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Since  Penrose  proposed  the  sib-pair  method  in  1935  and  the  affected-sib-pair 
(ASP)  method  in  1950  and  1953,  these  methods  have  been  widely  used  in  linkage 
studies.  Combined  with  contemporary  DNA-level  genetic  markers  technology,  ASP 
method  can  now  be  used  for  genome- wide  searches  for  disease  genes. 

A  two-stage  search  involves  first  a  screening  search  to  eliminate  nonviable  marker 
loci,  then  an  intensive  search  to  identify  gene  location  using  the  remaining  markers. 
Although  this  approach  has  been  suggested  in  the  literature,  its  properties  have  not 
been  thoroughly  investigated.  The  spacing  between  markers,  the  number  of  ASPs  to 
be  used  in  the  experiment  of  each  stage,  and  the  criteria  for  markers  to  pass  the  first 
stage  need  to  be  determined.  The  major  difficulties  of  the  two-stage  approach  are  (1) 
the  joint  distribution  of  the  nonindependent  statistics  is  difficult  to  handle;  (2)  the 
"random"  dependence  structure  of  the  test  statistics  in  the  second  stage;  and  (3)  in 
most  practically  available  sample  sizes  the  high  number  of  ties  in  test  scores  make 
asymptotic  approaches  inappropriate.  This  paper  intends  to  provide  a  solution  for 
designing  an  optimal  design  for  rare  autosomal  recessive  and  dominant  diseases. 


vin 


Power  computations  using  the  multinomial  distribution  supported  by  simulation 
show  that  in  most  cases  a  two-stage  design  is  usually  better  than  a  one  stage  design, 
but  not  always.  Combining  data  from  two  stages  will  gain  a  small  increase  in  accuracy. 
The  optimal  designs  for  resource  allocation  in  each  of  the  two  stages  are  obtained  and 
presented  in  tables. 
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CHAPTER   1 
INTRODUCTION 


Modern  genetics  began  with  the  work  of  Gregor  Mendel  (1822-84)  conducted 
between  1856  and  1863,  forming  the  basis  for  his  1866  paper  (Klug  and  Cummings, 
1997).  Gregor  Mendel,  an  Austrian  monk,  performed  a  series  of  experiments  on  the 
garden  peas.  Based  on  the  results  of  these  experiments,  he  proposed  a  particulate 
inheritance  theory  which  hypothesized  that  heritable  biological  characteristics  were 
carried  and  controlled  by  individual  "units."  We  call  these  units  genes  today,  but 
Mendel  called  each  unit  a  "Merkmal,"  the  German  word  for  "Character"  (Levine  and 
Miller,  1991). 

In  Mendel's  garden  pea  experiments,  he  studied  seven  characters  of  garden  peas. 
They  are:  round  or  wrinkled  ripe  seeds,  yellow  or  green  seed  interiors,  purple  or  white 
petals,  inflated  or  pinched  ripe  pods,  green  or  yellow  unripe  pods,  axial  or  terminal 
flowers,  and  long  or  short  stems  (Griffiths  et  al,  1993).  For  each  one  of  these  seven 
characters,  Mendel  obtained  pure  lines  of  plants.  "A  Pure  line  is  a  population  that 
all  offspring  produced  by  selfing  or  crossing  within  the  population  show  the  same  form 
of  the  character  being  studied"  (Griffiths  et  al.,  1993).  The  parental  generation, 
denoted  as  F0,  in  Mendel's  experiment  are  the  plants  of  these  pure  lines.  Mendel  first 
studied  these  characters  separately;  he  crossed  two  pure  lines,  one  for  each  phenotype, 
of  every  characters  to  obtain  the  first  filial  generation,  denoted  as  Fi.  All  the 
individuals  of  F!  have  only  one  phenotype  of  each  characters.  Next,  Mendel  self- 
fertilized  Fi  and  obtained  second  filial  generation,  denoted  as  F2.  The  results  are 
showed  in  Table  1.1.  Mendel  established  two  principles  to  explain  the  pattern  of  the 


Table  1.1.  Results  of  Mendel's  crosses  experiments. 


Parental  phenotype 

F, 

F2 

F2  Ratio 

Round  x  wrinkled  seeds 

All  round 

5474  round;  1850  wrinkled 

2.96:1 

Yellow  x  green  seeds 

All  yellow 

6022  yellow;  2001  green 

3.01:1 

Purple  x  white  petals 

All  purple 

705  purple;  224  white 

3.15:1 

Inflated  x  pinched  pods 

All  inflated 

882  inflated;  299  pinched 

2.95:1 

Green  x  yellow  pods 

All  green 

428  green;  152  yellow 

2.82:1 

Axial  x  terminal  flowers 

All  axial 

651  axial;  207  terminal 

3.14:1 

Long  x  short  stems 

All  long 

787  long;  277  short 

2.84:1 

Source:  Griffiths  et  al.,  1993,  p.  23. 

data  (Levine  and  Miller,  1991;  Griffiths  et  al,  1993): 

•  The  characteristics  of  an  organism  are  determined  by  individual  units  of  heredity 
called  genes.  Each  adult  organism  has  two  alleles  for  each  gene,  one  from 
each  parent.  These  alleles  are  segregated  (separated)  from  each  other  when 
reproductive  cells  are  formed,  each  gametes  receive  one  of  the  two  alleles  with 
equal  chance.  This  is  the  principle  of  segregation,  also  being  known  as  the 
Mendel's  First  Law. 


•  In  an  organism  with  contrasting  alleles  for  the  same  gene,  one  allele  may  be 
dominant  over  another.  This  is  known  as  the  principle  of  dominance. 

He  assumed  that  each  plant  had  two  alleles  for  each  trait  his  studied.  This  assumption 
was  correct.  "A  Allele  is  one  of  the  different  forms  of  a  gene  that  can  exist  at  a 
single  locus"  (Griffiths  et  al.,  1993,  p.  783).  He  introduced  the  notation  A  for  the 
dominant  allele  and  a  for  the  recessive  allele.  Since  the  parental  generations  were 
pure  lines,  their  genotypes  were  homozygous  AA  and  aa.  Thus,  the  Fx  population 
were  all  heterozygous  Aa  phenotype.   The  F2,  offspring  of  the  Fi,  were  expected  to 


be  A  A,  Aa,  and  aa  in  the  ratio  1:2:1;  and  dominant  (AA  and  Aa)  and  recessive 
(aa)  phenotype  in  ratio  3:1.  Mendel  cultivated  10  seeds  from  each  of  100  dominant  F2 
plants,  if  all  10  offspring  from  a  single  F2  plant  were  of  the  dominant  character  then  he 
concluded  this  plant  was  homozygous.  Mendel  used  this  experiment  to  demonstrate 
the  1:2  ratio  of  homozygous  (AA)  dominant  and  heterozygous  dominant  (Aa)  F2's. 
However,  for  a  heterozygous  F2  parent  there  is  still  a  (0.75)10  =  0.0563  probability 
would  produce  10  dominant  offspring,  and  hence  been  misclassified  as  homozygous. 
Therefore,  the  true  expected  ratio  should  be  0.3709:0.6291  instead  of  1:2.  Fisher 
suggested  that  Mendel's  data  are  too  close  to  1:2  rather  than  the  correct  values,  and 
thus  suspected  there  was  some  manipulation,  or  omission  of  data  (Weir,  1996). 

Mendel  also  studied  the  two  characters  together;  for  example  the  shape  of  the 
seeds  and  the  color  of  the  seed.  We  denote  round  allele  as  R  and  wrinkled  allele  as  r, 
and  yellow  allele  as  Y  and  green  allele  as  y.  He  crossed  a  parental  line  of  plants  with 
yellow  and  round  seeds  (genotype  RRYY)  with  a  line  with  green  and  wrinkled  seeds 
(genotype  rryy).  The  seeds  of  Fi  were  all  round  and  yellow  (RrYy)  showing  that 
these  two  alleles,  R  and  Y,  are  dominant  over  r  and  y.  Then,  he  crossed  Fi  and  got 
four  types  of  seeds,  315  round/yellow  seeds,  108  round/green,  101  wrinkled/yellow, 
and  32  wrinkled/green.  The  ratio  was  approximately  9:3:3:1  (Levine  and  Miller, 
1991).  All  possible  genotype  combinations  of  F2  are  given  in  the  following  Punnett 
Square  (Griffiths  et  al.,  1993);  the  first  row  represents  the  genotype  of  gamete  from 
father,  and  the  first  column  represents  the  genotype  of  gamete  from  mother,  and  the 
numbers  in  the  parenthesis  are  the  expected  frequencies  if  the  alleles  of  these  two 
characters  are  segregated  independently. 


RY  (J) 

Ry(i) 

iy(i) 

rY(J) 

RY  (J) 

RRYY  (i) 

RRYy (i) 

RrYy  (4) 

RrYY  (^) 

Ry(i) 

RRYy (A) 

RRyy (A) 

Rryy  (4) 

RrYy  (4) 

ry(i) 

RrYy  (i) 

Rryy  (4) 

rryy  (4) 

"Yy  (^) 

rY(i) 

RxYY  (4) 

RrYy  (4) 

"Yy  (4) 

"YY  (4) 

Mendel  established  the  following  principle  which  could  explain  the  pattern  of  the 
data. 

•  Each  of  the  seven  genes  that  Mendel  investigated  segregated  independently 
(independent  assortment),  also  being  known  as  the  Mendel's  Second  Law 
(Levine  and  Miller,  1991;  McPeek,  1997). 

Regardless  of  the  controversy  of  his  data,  Mendel's  theory  regarding  genes  controlling 
biological  characters  is  well  accepted. 

Correns  (1900)  observed  the  phenomenon  of  complete  linkage  in  which  alleles 
of  two  or  more  different  characters  appeared  to  be  always  inherited  together,  rather 
than  independently  as  Mendel's  Second  Law  (McPeek,  1997).  This  violation  led  to 
an  extension  of  Mendel's  theory;  the  chromosome  theory  of  heredity  which  says 
that  the  genes  are  parts  of  specific  cellular  structures,  the  chromosomes  (Griffiths  et 
al.,  1993).  "In  1902,  Walter  Sutton  and  Theodor  Boveri,  independently  published 
papers  linking  their  discoveries  of  the  behavior  of  chromosomes  during  meiosis  to  the 
Mendel's  principles  of  segregation  and  independent  assortment.  Sutton  and  Boveri 
are  credited  with  initiating  the  chromosomal  theory  of  heredity"  (Klug,  1997,  p.  61). 
This  theory  provided  a  physical  mechanism  for  Mendel's  Laws.  It  were  assumed  that 
those  characters  that  Mendel  studied  lay  on  different  chromosomes,  and  that  those 
which  were  completely  linked  lay  on  the  same  chromosome  (McPeek,  1997). 

Bateson  and  Punnett  did  an  experiment  on  the  sweet  peas  to  study  two  characters: 
the  flower  color,  purple  (dominant)  and  red  (recessive),  and  the  form  of  pollen,  long 


(dominant)  and  round  (recessive).  First,  they  crossed  purple,  long  (PPLL)  with 
red,  round  (ppll)  to  create  a  heterozygotes  (PpLl)  type  (Fl  generation),  then  they 
crossed  Fl  generation  among  themselves  and  produced  381  plants.  Result  is  as  follows 
(Levine,  1991,  p.    193).    These  results  did  not  match  Mendel's  Law,  nor  these  two 

Table  1.2.  The  results  of  Bateson  and  Bunnet's  experiment. 


Phenotype 

Number 

Percent 

Percent  expected 

Percent  expected 

observed 

observed 

if  unlinked 

if  completely  linked 

Purple,  long 

284 

74% 

56% 

75% 

Purple,  round 

21 

6% 

19% 

- 

Red,  long 

21 

6% 

19% 

- 

Red,  round 

55 

14% 

6% 

25% 

Source:  Levine,  1991. 


genes  were  completely  linked.  Morgan  (1911)  provided  an  explanation  for  Bateson 
and  Punnett's  observation  and  for  his  own  fruit  fly  experiment's  results  which  is 
similar  to  Bateson  and  Punnett's.  He  suggested  the  exchange  of  genetic  material  had 
occurred  between  two  homologous  chromosomes  when  they  paired  during  cell  division, 
and  this  exchange  was  called  a  crossover  (McPeek,  1997).  Recombination  is  the 
process  by  which  progeny  derive  a  combination  of  genes  different  from  that  of  either 
parent  (DOE,  1992).  Each  individual  inherits  one  chromosome  from  its  father  and 
the  other  one  from  its  mother.  When  this  individual  reproduces  two  chromosomes  to 
pass  to  its  offspring  during  meiosis,  it  will  not  pass  one  of  each  pair  that  it  inherited 
from  its  parents  but  a  "blended"  one.  Meiosis  produces  crossovers  that  blend  two 
chromosomes  and  cause  recombinations.  If  there  is  no  crossover  or  an  even  number 
of  crossovers  between  two  loci,  then  recombination  does  not  happen  to  these  two 
loci.  On  the  other  hand,  if  there  are  odd  number  of  crossovers,  then  recombination 
happens.     The  probability  of  recombination  happening  between  two  loci  is  called 


the  recombination  fraction.  At  the  state  of  current  technology  level,  cross-overs 
are  not  directly  observable  but  recombination  between  two  genes  (or  markers)  are 
(Ott,  1991).  An  obvious  consequence  of  the  crossover  process,  is  that  the  closer  two 
genes  are  to  each  other  on  a  chromosome,  the  less  chance  that  crossovers  can  happen 
between  them.  Hence  there  is  less  chance  that  recombination  can  happen  between 
them.  This  idea  became  the  foundation  of  linkage  analysis.  Recombination  is  the 
chief  source  of  variation  among  species  (Rhodes  et  al.,  1974). 

The  purpose  of  linkage  study  is  to  find  the  relative  location  between  genes  or 
markers  and  finally  to  present  them  in  a  linkage  map.  The  distance  between  two  loci 
on  a  chromosome  is  measured  using  genetic  distance.  We  will  discuss  this  in  §2.1 
"Map  Function". 

Lander  and  Schork  (1994)  summarized  that  there  are  four  major  categories  of 
methods  for  genetic  dissection.  One  is  "genetic  analysis  of  large  crosses  in  model 
organisms  such  as  the  mouse  and  rat."  The  other  three  use  human  genetic  data.  These 
three  methods  are,  pedigree  analysis,  allele-sharing  methods,  and  association  studies 
in  human  population.  We  will  discuss  more  details  in  §2.2  "An  Introduction  to  Human 
Linkage  Data  for  Genetic  Dissection".  Affected-sib-pair  method,  proposed  by  Penrose 
(1953),  is  one  of  allele-sharing  methods.  Affected-sib-pair  method  combined  with 
modern  DNA-level  genetic  marker  technology,  like  the  Restriction  Fragment  Length 
Polymorphism  markers  (RFLP)  (Botstein  et  al.,  1980),  can  be  used  for  genome-wide 
gene  search. 

This  dissertation  will  discuss  a  two-stage  genome-wide  approach  in  affected-sib- 
pair  method  for  searching  for  genetic  disease  genes.  Only  the  designs  for  searching 
for  rare  recessive  and  dominant  diseases  with  dichotomous  phenotype  (affected  or  not 
affected)  is  studied.  Chapter  2  provides  a  literature  review  of  related  topics,  includ- 
ing map  function,  sib-pair  method,  affected-sib-pair  methods,  identical  by  descent, 
identical  by  state,  affected-relative-members  method,  heterogeneity,  and  polygeneity. 


Chapter  3  discusses  how  to  design  an  optimal  two-stage  search  for  a  simple  Mendelian 
disease  gene  and  Chapter  4  discusses  how  to  design  an  optimal  two-stage  search  for 
two  low  penetrance,  no  interaction,  unlinked  recessive,  complex  disease  genes.  Brief 
concluding  remarks  are  given  in  the  last  chapter. 


CHAPTER  2 
LITERATURE  REVIEW 


In  this  chapter,  the  following  topics  will  be  covered: 

•  map  function  (without  biological  interference), 

•  concepts  of  identical-by-descent  (IBD)  and  identical-by-state  (IBS), 

•  test  statistics;  based  on  IBD,  IBS  scores  and  likelihood  ratio, 

•  sib-pair  method  and  affected-sib-pair  method, 

•  heterogeneity  and  homogeneity, 

•  polygeneity, 

•  two-stage  approach. 

2.1     Map  Function 

The  genetic  map  distance,  in  units  called  Morgans,  between  two  genes  is  defined 
as  the  expected  number  of  crossovers  occurring  between  two  genes  on  a  single  chro- 
mosome strand  (Ott,  1991).  When  the  distance  is  so  small  that  the  probability  of 
multiple  crossovers  is  negligible,  the  genetic  map  distance  is  equivalent  to  the  recom- 
bination fraction.  A  centimorgan  between  two  genes  is  a  distance  that  produces  an 
approximate  0.01  probability  that  recombination  taking  place  between  them.  When 
the  distance  is  large,  the  probability  of  multiple  crossovers  increases.  If  there  is  zero 
or  an  even  number  of  crossovers,  then  we  will  not  observe  recombination,  but  if  there 
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is  an  odd  number  of  crossovers  we  will  observe  recombination.  Hence,  the  recombi- 
nation fraction  is  not  necessarily  equal  to  the  map  distance  when  two  loci  are  farther 
apart.  A  map  function  is  used  to  relate  the  additive,  but  hard  to  measure,  ge- 
netic or  map  distance  to  the  non-additive,  but  more  readily  estimable,  recombination 
fraction  (Speed,  1997).  Map  function  should  preserve  the  additivity  of  map  distance 
but  not  every  map  function  does.  For  example,  consider  three  loci,  xx,  x2,  x3  on  a 
chromosome  and  let  On  and  #23  be  the  map  distance  between  x\  and  x2,  and  x2  and 
x3  respectively,  and  ri2  and  r23  be  the  recombination  fraction  distance  between  x\ 
and  x2,  and  x2  and  x3  respectively.  If  we  assume  that  xl5  x2,  and  x3  are  very  close 
such  that  the  chance  of  multiple  crossovers  between  them  is  negligible  then  the  Mor- 
gan map  function  will  not  preserve  additive  of  distance  but  the  Haldane  (1911)  map 
function  will.  For  Morgan  map  function,  ri2  =  #12  and  r23  =  023,  the  map  distance 
between  X\  and  x3,  013  is  equal  to  012  +  #23,  hence  recombination  fraction  between 
xx  and  x3,  ri3  is  equal  to  r12  +  r23.  On  the  other  hand,  since  the  chance  of  multiple 
crossovers  is  negligible,  then 

Pr(recombination  happen  between  x\  and  x3) 
=    Pr(recombination  happen  between  X\  and  x2  but  does  not  between  x2  and  x3) 
+     .Pr(recombination  does  not  happen  between  Xi  and  x2  but  between  x2  and  x3) 

=      012(1  —  #23)  +  (1  —  #12)#23 
=      #12  +  #23  "~  2#12#23 

7^     T\%  +  r23. 

This  is  an  contradiction.  For  Haldane  map  function,  since  the  map  distance  is  #i2+#23, 
rl3  is  equal  to  0.5(1  -  exp-2(*12+(?23>),  and 

Pr(recombination  happen  between  Xi  and  x3) 
=    0.5(1  -  exp-2"12)[l  -  0.5(1  -  exp-2"23)]  +  0.5(1  -  exp-2*")[l  -  0.5(1  -  exp"2^)] 
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=    0.5{(1  -exp-2*12)  +  (1  -exp"2"23)  -  (1  -  exp-2*12)(l  -  exp-2*23)} 
=    0.5(1 -exp-2(Sl2+e23)). 

The  additive  of  map  distance  is  preserved. 

Also,  "most  map  functions  are  determined  by  three  loci  comparisons  which  may 
not  be  consistent  in  terms  of  four  or  more  loci"  (Liberman  and  Karlin,  1984).  If  a 
map  function  is  consistent  for  any  number  of  loci  then  it  is  called  multilocus-feasible. 
Several  counterexamples  in  Ott  (1991,  p.  127)  showed  that  a  non-multilocus-feasible 
might  result  a  negative  probability. 

Liberman  and  Karlin  (1984)  investigated  these  problems.  Their  conclusions  can  be 
summarized  as  follows:  there  are  two  methods  for  constructing  a  genetic  map  function, 
the  first  starts  with  a  model  of  the  recombination  process,  the  second  uses  a  differential 
equation  method.  "The  renewal  crossover  formation  process  model  assumes  that 
crossovers  occur  in  succession  along  the  chromosome,  starting  at  a  natural  biological 
site  (e.g.,  the  centromere,  some  origin  of  replication,  or  a  telomere)  such  that  the 
lengths  of  the  intervals  between  successive  crossovers  are  positive  random  variables 
independently  and  identically  distributed." 

For  the  first  method,  if  a  renewal  crossover  formation  model  is  specified,  the  only 
feasible  map  function  is  the  Haldane  map  function  (Haldane,  1919).  The  correspond- 
ing crossover  formation  process  is  a  Poisson  process. 

Theorem  1  Consider  a  crossover  formation  model  of  the  form  of  a  renewal  process 
with  intercrossover  distribution  F(t),  the  distribution  function  of  the  distance  between 

two  successive  crossovers,  such  that  ^r(O)  /  0  for  some  positive  integer  n.  Then  a 
map  function  exists  if  and  only  if  the  intercrossover  distribution  agrees  with  that  of 
an  exponential  distribution,  i.e.,  the  renewal  process  is  a  Poisson  process  (Liberman 
and  Karlin,  1984). 
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On  the  issue  of  multilocus  feasibility  of  map  functions,  they  derived  necessary  and 
sufficient  conditions  for  multilocus-feasibility. 

Theorem  2  Let  r  be  recombination  fraction  and  x  be  genetic  distance.  Sufficient 
conditions:  A  map  function  r  =  M(x)  is  multilocus-feasible  if  its  derivative  functions 
M*n)  obey  the  inequalities 

(-l)nM(n)(x)  <0,         n  =  1,2,...,  for  all  x  >  0. 

Necessary  conditions:  Let  r  =  M(x)  be  a  map  function  that  is  multilocus-feasible 
and  suppose  that  all  derivatives  M^n\x)  exist.  Then 

(-l)"M'nl(0)  <0,  Vn=l,2,.... 

Table  2.1  shows  the  multilocus  feasibility  of  several  map  functions.  Note  that  the 
multilocus-feasible  map  functions  are  constructed  based  on  the  cross-over  formation 
process  method,  where  r  is  the  recombination  fraction  and  x  is  genetic  map  distance. 
Because  further  extension  will  incorporate  genetical  interference,  we  will  not  discuss 
it  here,  but  it  can  be  referred  to  Karlin  and  Liberman  (1994),  and  Speed  (1997). 
Most  map  functions  constructed  by  a  differential  equation  method  are  not  multilocus 
feasible,  also  will  not  be  discussed. 

2.2     An  Introduction  to  Human  Linkage  Data  for  Genetic  Dissection 

The  purpose  of  linkage  study  is  to  locate  the  gene  which  we  are  interested  in. 
With  contemporary  DNA-level  genetic  marker  technology  (Botstein  et  al.,  1980),  one 
way  to  achieve  this  is  by  studying  the  linkage  between  genes  and  markers.  "When  two 
genes  are  inherited  independently  of  each  other,  recombinants  and  nonrecombinants 
are  expected  in  equal  proportions  among  the  offspring.  For  some  pairs  of  genes,  one 
observes  a  consistent  deviation  from  the  1:1  ratio  of  recombinant  to  nonrecombinant 
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Table  2.1.  Multilocus-feasibility  of  several  map  functions. 


Source 

Map  Function  r  =  M(x) 

Multilous 
Feasible 

Haldane  (1919) 

1(1  -exp-2*) 

Yes 

Ludwig  (1934) 

|sin(2z) 

No 

Kosambi  (1944) 

\  tanh(2x) 

No 

Carter       and       Falconer 

x  =  J|  tan_1(2r)  +  tanh_1(2r)  | 

No 

(1951) 

Sturt  (1976) 

|[l-(l-(a:/L))exp-^2L-1)/L] 

Yes 

Rao  et  al.  (1977) 

x=i[  p(2p-l)(l-4p)ln(l-2r) 

No 

+  16p(l  -  p)(2p  -  1)  tan_1(2r) 

+2p(l  -  p)(8p  +  2)  tanh_1(2r) 
+6(l-p)(l-2p)(l-4p)r] 

Felsenstein  (1979) 

l-exp2^-2)* 

No 

2(l-(A'-l)exp2('c-2)1) 

Source:  Liberman  and  Karlin,  1984. 
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offspring  ...  In  other  words,  alleles  of  different  genes  appear  to  be  genetically 
coupled,  and  this  phenomenon  is  called  genetic  linkage."  (Ott,  1991,  p.  6)  A 
marker  is  an  small  identifiable  physical  region  on  a  chromosome  and  the  inheritance 
of  this  region  can  be  monitored  (DOE,  1992).  The  closer  a  marker  is  to  the  gene,  the 
smaller  the  chance  that  they  will  be  recombined  during  the  DNA  replication  process. 
Therefore,  if  a  marker  has  a  strong  correlation  with  the  phenotype  of  a  gene,  the  gene 
should  be  in  proximity  to  the  marker.  There  are  three  major  categories  of  methods 
using  human  genetic  data:  pedigree  analysis,  allele-sharing  method,  and  association 
study.  Risch  and  Merikangas  (1996)  summarized  these  methods  as  follows: 

In  pedigree  analysis,  we  first  collect  pedigrees  that  contain  affected  members.  Next 
we  propose  a  genetic  model  (location  of  the  gene,  allele  frequency,  mode  of  inheritance, 
and  so  on),  say  Mi,  to  explain  the  pattern  observed  in  the  data,  and  compare  the 
likelihood  under  Mi  with  the  likelihood  under  null  hypothesis  M0,  which  assumes  no 
gene  in  the  region  of  the  marker,  by  the  likelihood  ratio, 

L(data\M0) 
L{data\Mxy 

or  equivalently  by  logarithm  of  the  odds  (LOD)  (Barnard,  1949)  score, 

L(data\Ml) 


log 
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L(data\M0)' 


Since  this  method  requires  specifying  a  genetic  model,  it  is  mostly  used  to  identify 
simple  Mendelian  trait  genes. 

In  allele-sharing  methods,  we  first  collect  relative  pairs  or  groups  (most  of  the 
time,  we  collect  affected  relatives),  and  try  to  prove  that  the  pattern  observed  in 
the  data  is  not  due  to  random  Mendelian  segregation.  Most  of  these  methods  use 
identical-by-decent  scores,  which  is  the  number  of  identical  copies  of  the  markers  that 
relatives  share.  The  major  advantage  is  that  no  specific  genetic  model  is  required  to 
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do  the  test,  so  they  are  considered  as  nonparametric  methods.  These  methods  are 
mostly  used  for  detecting  non-Mendelian  complex  trait  genes. 

Association  study  is  a  case-control  study.  Instead  of  using  familial  data,  rather  it 
collects  unrelated  affected  and  unaffected  individuals,  and  then  compares  the  allele 
frequencies  of  the  candidate  genes  or  markers.  If  the  disease  gene(s)  is/are  one  of 
those  candidate  gene  or  very  strongly  associated  with  some  marker  alleles,  then  the 
affected  group  will  have  higher  frequencies  of  those  alleles  than  the  control  group.  This 
method  is  mostly  used  after  the  possible  region  of  the  genes  has  been  narrowed  down, 
because  this  method  involves  examining  genes  and/or  markers  on  a  very  fine  scale, 
i.e.,  examining  many  markers  in  a  small  area  of  chromosome.  "Association  studies 
seem  to  be  of  greater  power  than  linkage  studies.  But  of  course,  the  limitation  of 
association  studies  is  that  the  actual  gene  or  genes  involved  in  the  disease  must  be 
tentatively  identified  before  the  test  can  be  performed.  .  .  .  Thus,  the  primary 
limitation  of  genome- wide  association  tests  is  not  a  statistical  one  but  a  technological 
one"  (Risch  and  Merikangas,  1996,  p.  1517). 

Allele-sharing  methods  include  the  affected-sib-pair  (ASP)  method  and  its  vari- 
ants. Penrose  was  considered  the  first  person  to  propose  the  sib-pair  method  (Con- 
neally  and  Rivas,  1980;  Shah  and  Green,  1994).  In  1935,  Penrose  proposed  a  simple 
X2  test,  to  detect  linkage  between  two  characters.  In  1938,  he  extended  this  method 
to  "graded"  characters,  which  means  the  characters  could  have  intermediate  value. 
In  1950,  Penrose  proposed  the  concept  of  affected-sib-pair  method  as  a  means  for 
analyzing  red  hair  phenotype  and  ABO  blood  group  data.  He  wrote:  "Accuracy  is 
preserved  and  uninformative  dead  wood  excluded  if  a  set  of  sibships  is  selected  by  the 
presence  of  one  of  the  test  characters  .  .  ."A  general  form  of  the  sib-pair  method  was 
proposed  by  Penrose  in  1953.  Now,  these  methods  have  been  extended  to  examine 
identical- by-descent  scores  (elaborated  in  the  following  sections),  identical-by-state 
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scores  of  markers,  and  the  risk  ratio  between  two  relatives.  The  idea  of  affected-sib- 
pair  method  is  only  allow  a  pair  of  sibs  both  who  are  affected  to  be  included  in  the 
study.  If  the  disease  is  caused  by  a  gene  then  both  affected  sibs  tend  to  receive  the 
same  gene.  Hence,  if  a  marker  is  close  to  the  gene,  then  the  marker  will  tend  to 
segregate  with  the  gene  during  meiosis.  We  will  cover  how  to  detect  the  linkage  in 
the  later  sections. 

The  major  advantages  of  affected-sib-pair  method  over  pedigree  analysis  were 
summarized  by  Holmans  and  Craddock  in  1997: 

•  Affected-sib-pair  method  does  not  require  specification  of  a  genetic  model,  which 
is  important  for  complex  disease  where  the  mode  of  inheritance  is  unclear. 

•  It  is  generally  easier  to  collect  affected  sibling  data  than  to  collect  a  large, 
multigeneration  pedigree  with  multiple  affected  members. 

•  Affected  sib-pairs  are  more  likely  to  be  informative  for  linkage  than  large  pedi- 
gree under  oligogentic  epistatic  models,  which  is  plausible  for  a  number  of  com- 
plex traits. 

The  disadvantage  of  affected-sib-pair  method  is  that  it  is  considered  less  powerful  than 
traditional  pedigree  analysis  when  the  genetic  model  can  be  specified.  If  the  genetic 
model  cannot  be  specified  in  the  affected-sib-pair  methods,  the  recombination  fraction 
cannot  be  estimated. 

Suarez,  Rich,  and  Reich  (1978),  and  Blackwelder  and  Elston  (1985)  suggested  that 
sampling  only  affected  sib-pairs  is  more  powerful  then  sampling  affected-unaffected 
sibs  under  a  fixed  sample  size  constraint.  Some  sampling  issues  in  linkage  studies  as 
summarized  by  Gershon  et  al.  (1994)  appear  in  Table  2.2. 
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Table  2.2.  Some  sampling  issues  in  linkage  studies. 


Issue 

Pro 

Con 

Very  large  pedigree 

Medium-size 
pedigree  series 

Nuclear  families 
Affected  sib-pairs 

Statistical  power  high  under 
homogeneity  assumption. 

Less    likely    to    have    het- 
erogeneity within  pedigrees. 
More  likely  to  get  generaliz- 

able  results. 
Model-free 

Extension  may  actually  in- 
troduce heterogeneity.  Very 

hard  to  find. 

Heterogeneity  between  pedi- 
grees. 

Lower       power:              very 
many  needed  if  heterogene- 
ity present. 

Even  lower  statistical  power. 

Source:  Gershon  et  al.,  1994. 


2.3     Identical  by  State,  Identical  by  Descent 


IBS  stands  for  "identical  by  state."  It  means  two  alleles  are  the  same  regardless 
of  whether  the  alleles  are  copies  of  the  same  ancestral  alleles  or  not.  IBD  stands 
for  "identical  by  descent."  Two  alleles  are  said  to  be  IBD  if  those  two  alleles  are 
copies  of  the  same  ancestral  alleles.  Since  humans  are  diploid,  that  is,  having  two 
haploid  sets  of  chromosomes,  two  sibs  can  share  2,  1,  or  0  marker  alleles  at  a  locus. 
Consequently,  their  IBD  scores  would  be  equal  to  2,  1,  or  0,  respectively.  We  will 
discuss  the  distribution  of  IBD  score  in  the  next  section.  We  only  consider  two  alleles 
are  identical-by-decent  if  they  are  copies  of  the  same  ancestral  allele  in  the  pedigree 
under  study,  i.e.,  if  the  alleles  are  traceable. 
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With  the  affected-sib-pair  method,  since  the  sib-pairs  were  selected  for  study  on 
the  condition  that  both  sibs  are  sick  with  the  same  disease  (biological  character), 
if  the  disease  is  caused  by  a  gene,  then  a  marker  adjacent  to  the  disease  gene  has 
a  higher  probability  to  segregate  with  the  gene  during  meiosis.  Hence,  this  marker 
tends  to  have  a  higher  IBD  score.  This  method  can  serve  as  a  tool  to  locate  the  trait 
genes. 

2.4     Distributions  of  IBD 

Under  ideal  conditions,  in  using  the  affected-sib-pair  method,  we  can  take  IBD 
scores  to  make  inferences  about  the  recombination  fraction  between  the  trait  gene(s) 
and  the  marker (s).  Hence  it  is  important  to  know  the  distribution  of  the  IBD  scores. 
According  to  Mendel's  First  Law,  each  of  a  parent's  haplotypes  have  an  equal  chance 
to  be  passed  to  offspring.  Therefore,  if  sibs  were  not  selected  conditionally  on  any 
other  biological  characters,  then,  the  probability  for  a  pair  of  full  sibs  to  have  a  IBD 
score  equal  to  2  is  0.25,  to  1  is  0.5,  and  to  0  is  0.25.  If  sibs  are  selected  conditional  on 
a  biological  character  and  if  the  marker(s)  we  examine  is  unlinked  to  the  trait  gene(s) 
that  responsible  for  the  biological  character,  then,  regardless  of  the  inheritance  mode, 
penetrance,  and  allele  frequency  of  the  trait  gene  the  probability  mass  function  is  also 
0.25,  0.5,  and  0.25  for  IBD  equal  to  2,  1,  and  0,  respectively.  Therefore,  if  the  observed 
IBD  score  deviates  from  this  mass  function,  it  is  an  evidence  that  the  observed  IBD 
scores  are  not  from  a  random  model.  Nevertheless,  under  specified  condition  the 
distribution  of  the  IBD  score  is  still  worth  investigating,  especially  to  develop  a  more 
powerful  test  statistics  or  to  do  power  analysis. 

Li  and  Sacks  (1954)  developed  a  method  using  stochastic  matrices  to  derive  the 
joint  distribution  of  the  genotype  of  two  relatives.  They  used  an  example  to  demon- 
strate their  method:  a  gene  with  two  alleles,  say  A  and  a,  with  population  frequencies 
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p  and  q  =  1  —  p.  First,  obtain  a  conditional  probability  matrix  P={pij};, 


relative  2 
AA     Aa     aa 
Pii    Pia    P13 

P21      P22      P23 
P31      P32      P33 


where  pn  is  the  conditional  probability  conditional  on  relative  1  having  AA,  that 
relative  2  has  AA;  pu  that  relative  2  has  Aa;  and  pi3  that  relative  2  has  aa,  and  so  on. 
Once  such  a  matrix  is  obtained,  the  absolute  frequencies  of  all  different  combinations 
of  genotypes  of  relatives  pairs  can  be  easily  obtained  by  multiplying  corresponding 
row  by  genotype  frequencies  of  relative  1.  In  their  example,  that  would  be  multiplying 
the  first  row,  (pu,pu,Pi3)  by  p2,  the  second  row  by  2pq,  and  the  third  row  by  q2.  Li 
and  Sacks'  method  use  three  basic  transition  probabilities  matrices,  /,  T,  and  0,  to 
construct  P.  The  matrices,  /,  T,  and  0  is  defined  same  as  P  except  conditional  on 
relatives  pair  having  both,  one,  or  no  genes  identical  by  descent,  respectively.  One 
can  see  that,  in  their  example, 


/  =  {»«}  = 


1 

0 

0 

0 

1 

0 

0 

0 

1 

,r  =  {<y}  = 


V 

o 

0 

\p 

i 

2 

0 

0 

P 

<? 

,0  =  {ol3}  = 


p2  2pq  q2 
p2  2pq  q2 
p2     2pq    q2 


where  /  =  1,2,3  and  j  =  1,2,3.  For  the  matrix  /;  since  given  that  relative  1  and  2 
share  two  genes  identical  by  descent,  then  the  probability  of  the  genotype  of  relative 
2  is  AA,  given  the  genotype  of  relative  1  is  AA,  is  equal  to  1,  so  in  =  1.  For  the  same 
reason,  the  diagonal  elements  of  /,  in  are  equal  to  one,  and  off  diagonal  elements 
are  equal  to  zero.  Now  consider  the  matrix  T,  since  given  that  both  relative  share 
only  one  gene  IBD,  if  relative  1  has  genotype  AA,  and  relative  2  also  has  genotype 
AA,  relative  2  must  receive  one  of  his  A  alleles  somewhere  else  other  than  the  same 
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ancestor  from  whom  relative  1  received  his  A  allele.  Since  population  frequency  of  A 
is  p,  thus  tu  of  the  matrix  T  is  equal  to  p.  The  other  elements  of  T  and  O  can  be 
obtained  by  the  same  reasoning. 

They  showed  several  examples  of  how  to  use  /,  T,  and  O  to  construct  P.  The 
one  for  the  parent-offspring  relationship,  the  transition  matrix  P  =  T,  and  the 
grandparent-grandchild  pair  has  transition  matrix 

P  =  T2  =  -T  +  lo, 
2         2 

and  for  the  general  parent-offspring  type  relatives, 

P  =  Tn+1  =  l-nT  +  (l-(1-)n)0, 

where  n  +  1  is  the  number  of  generations  between  the  two  relatives.  For  the  full-sib, 
the  matrix  is 

P  =  S=-I+-T  +  -0. 
4        2         4 

Campbell  and  Elston  (1971)  extended  Li  and  Sacks'  method  to  derive  the  transi- 
tion probability  matrices  for  different  modes  of  inheritance  and  multiple  loci.  Hase- 
man  and  Elston  (1972)  constructed  the  joint  probability  matrix  for  sibs.  Risch  (1990b) 
and  Bishop  and  Williamson  (1990)  summarized  these  works  and  gave  a  table  of  trait 
IBD  distribution  conditional  on  marker  IBD  and  relationship  for  sibs,  grandparent- 
grandchild,  uncle-nephew,  half-sibs,  and  first  cousins.  It  is  adapted  here  in  Table  2.3. 

Holmans  (1993)  showed  that  for  an  affected  sib-pair,  the  possible  values  of  prob- 
ability mass  function  p  =  (p0,Pi,l  —  Po  —  Pi)  of  marker  IBD  score  0,  1,  and  2,  is 
restricted  in  what  they  called  "the  possible-triangle",  which  is  the  intersection  of 
Po  >  0,  pi  <  |,  and  pi  >  2p0,  for  all  genetic  models.  Their  proof  is  included. 

Let  pj  and  z,  be  the  probabilities  of  a  sib-pair  sharing  ibd  at  marker  locus  and 
trait  locus  respectively,  6  be  the  recombination  fraction  between  the  marker  and  the 
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Table  2.3.  The  values  of  Pr(IBD  of  trait  gene  |  IBD  of  marker  and  relationship). 


IBD  of  marker 

IBD  of  Trait 

2 

1 

0 

sibs 

2 
1 
0 

\]/2 

(l-*)2 

2*(l-¥) 

(l-2*  +  2*2) 
2*(1  -9) 

(1-tf)2 

$2 

grandparent-gra 

ndchild 

1 
0 

1-  9 
9 

6 

1-9 

uncle-nephew 

1 
0 

9(l-8)  +  \Q 

l-  \e  -9(i-$) 

l-\9-^{\-9) 
9(1-0)  +  \B 

half-sibs 

1 
0 

i-  * 

1-9 

first  cousins 

1 
0 

9(i-e)2  +  i$3 

§{i-|02-*(i-0)2}   | 

1-  ±02-*(l-0)2 
{2+#(l-0)2  +  l02} 

Where  9  is  the  recombination  fraction  between  the  marker  and  trait  gene, 

*  =  02  +  (l  -9)2. 
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trait  gene,  and  tf  =  92  +  (1  -  Of.  Then, 

p0     =     ^2z0  +  ^(l-^)zl  +  (l-^)2z2 

Pi    =    2*(1  -$)z0  +  [y2  +  {l-y)2]z1  +  2tf(l  -¥)za 

p2       =       (l-f)22o  +  *(l-*)21  +  *2Z2. 

From  the  second  equation, 

Pl  =  2tf(l  -  *)(*„  +  zi  +  z2)  +  [tf2  +  (1  -  tf)2  +  2*(1  -  *)]zj. 
Since  j  <  $  <  1,  and  it  can  be  shown  that  Z\  <  i, 

Pl  <  2^(1  -  ^)  +  (2^  -  l)2/2  =  |. 
From  the  first  and  second  equations 

Pi  -  2p0 
=     (2*  -  4tf2)z0  +  (2*  -  \fZl  +  {(1  -  *)[2*  -  2(1  -  q)\}z2 
=    (2$  -  l)22l  +  2(1  -  *)(2*  -  1)(1  -  zx  -  zq)  -  2#(2tf  -  1)20 

>  (2*  -  l)2^  +  2(1  -  *)(2*  -  1)  -  3(1  -  tf)(2tf  -  l)2x  -  tf  (2$  -  1)*, 
=    2(1  -  *)(2*  -  1)(1  -  2zx) 

>  0. 

This  proves  the  "possible  triangle"  restriction.  He  also  proposed  a  procedure  for  find- 
ing the  maximum  likelihood  estimator  under  the  above  restriction.  That  procedure 
is  as  follows: 

1.  Obtain  z  =  (zq,  z\,1  —  z0  —  z\)  from  the  unrestricted  method  which  is  sample 
frequency. 

2.  If  £i  >  |,  remaximize  z  subject  to  the  constraint  z\  =  |.  If  the  resultant  z0  >  J, 
then  reset  z  to  null  which  is  (0.25,  0.5,  0.25). 
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3.  If  2z~0  >  z[,  remaximize  z  subject  to  the  constraint  Z\  =  2z0.   If  the  resultant 
Zq  >  45  then  reset  z  to  null. 

4.  If  z  is  in  the  triangle,  leave  it  as  it  is. 

This  concludes  the  procedure. 

Suarez,  Rice,  and  Reich  (1978)  derived  a  generalized  sib-pair  IBD  score  distribu- 
tion in  which  the  IBD  score  distribution  is  conditioned  on  the  number  of  affected  sibs 
for  a  trait  with  two  alleles.  This  conditional  distribution  showed  that  "the  distribu- 
tion of  IBD  scores  depends  only  on  the  additive  and  dominance  variances  and  the 
population  prevalence  of  the  disorder"  (p.  94). 

One  problem  with  using  the  IBD  scores  is,  in  order  to  establish  IBD  scores,  marker 
alleles  must  be  highly  polymorphic.  "Often,  identity  by  descent  cannot  unequivocally 
be  established"  (Ott,  1991,  p.  79).  To  overcome  this  problem  with  the  IBD  score, 
Lange  (1986a)  proposed  an  affected-sib-sets  method  based  on  IBS  (identical  by  state) 
and  at  same  time  he  extended  the  affected-sib-pair  method  to  the  affected-sib-set 
(ASS)  method  (Lange,  1986b).  He  proposed  a  test  statistic  for  the  affected  sibs  in  a 
single  nuclear  family, 

Z    =    E**  (2-1) 

1         t  and  j  concordant  (i.e.  IBS=2), 

\        i  and  j  half-concordant  (i.e.  IBS=1),  (2.2) 

0         i  and  j  discordant  (i.e.  IBS=0), 

where  i  and  j  are  indexes  of  sib-set  members.  Lange  noted  that  since  they  "will 
ultimately  apply  the  Central  Limit  Theorem,  it  suffices  to  derive  the  mean  and  vari- 
ance of  Z  for  a  particular  affected  sib  set."  This  ASS  method  uses  population  allele 
frequency  to  calculate  the  probabilities  of  all  the  possible  parent  mating  types,  then, 
conditional  on  the  parental  mating  type,  to  calculate  the  probability  of  IBS  of  the 


Xu     — 
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sib-pair.  Lange  gave  the  mean  and  variance  of  Z  "in  the  limiting  case  of  an  infinite 
number  of  alleles,  each  of  infinitesimal  frequency,"  (this  author  believes  that  means 
very  highly  polymorphic)  as  follows, 

E(Z)    =    *(3-l)/4,  (2.3) 

Var{Z)    =    a(*-l)/16,  (2.4) 

where  s  is  the  size  of  the  sib-set.  He  then  presented  a  way  to  combine  Z  statistics 
from  different  sib-sets, 

T  _    YJsY,rWrs{Zrs-E(Zrs)) 

where  Zrs  denote  the  Z  statistic  for  the  rth  affected  sib  set  of  size  s,  and  wrs  is 
a  weighting.  To  assign  a  value  to  wrs,  he  claimed  if  the  weights  wrs  depend  only 
on  s  and  not  on  r,  and  if  the  number  of  sib  sets  is  large,  then  T  should  follow  a 
standard  normal  distribution  approximately.  Based  on  Hodge's  (1984)  results,  Lange 
recommended  using  the  weights, 


(*-l) 


Var(Zrs)2 


In  1988,  Weeks  and  Lange  (1988)  generalized  the  ASS  method  to  pedigrees,  calling 
the  refined  method  the  affected-pedigree-member  (APM)  method.  In  this  method 
only  those  pedigree  members  who  are  both  affected  and  typed  at  the  marker  locus 
enter  the  definition  of  test  statistic.  They  modify  the  above  test  statistic  T  in  two 
ways.  First,  modified  the  Z  statistic  to  be 

Zij  =  -5{Gix,  GJX)  +  -6(Gix,  Gjy)  +  -S(Giy,  Gjx)  +  -S(Giy,  Gjy), 

where  i  and  j  are  the  index  for  affected  members  in  a  pedigree,  G,x  and  Gjx  are  the 
maternal  marker  allele  for  member  i  and  j  respectively,  G!y  and  Gjy  are  the  paternal 
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marker  allele  for  member  i  and  j  respectively,  and 

1     G  and  G'  match  in  state 


8(G,  G') 

0     G  and  G'  do  not  match  in  state. 

They  claim  this  modification  "permits  computation  of  the  theoretical  mean  and  vari- 
ance of  the  new  test  statistic  for  each  pedigree  by  taking  advantage  of  the  theory  of 
multiple-person  ibd  relation."  Second,  a  new  weighting  factor 


Vrm  -  1 


y/Var(Zm) 

where  rm  is  the  number  of  affected  and  typed  individuals  in  the  mth  pedigree,  and 
Zm  is  the  Z  statistic  of  the  mth  pedigree.  Later,  Week  and  Lange  (1992)  proposed  a 
multilocus  extension  of  this  APM  method. 

Bishop  and  Williamson  (1990)  studied  the  power  of  IBS  methods  on  affected 
relative  pairs  and  showed  that  several  factors  have  a  major  influence  on  the  power. 
These  factors  are,  the  relationship  of  the  affected  pairs,  the  polymorphism  of  the 
marker,  the  recombination  distance  between  a  trait  locus  and  the  closest  marker,  and 
the  mode  of  inheritance  of  the  trait. 

2.5     Risk  Ratio 

In  1990,  Risch  published  a  series  of  papers  addressing  risk  ratio  methods.  In  the 
first  of  this  series  (1990a),  a  ratio  of  risk,  Xr,  is  defined  as  the  ratio  of  the  probability  of 
being  affected  given  a  type  R  relatives  is  affected,  and  the  population  prevalence.  Let 
the  relationship  subscripts  as  follows:  M=monozygotic  twin;  S=sibling  (or  dizygotic 
twin);  0=  parent  (or  offspring);  2=  second-degree  relatives;  3=third-degree  relative. 
For  a  single-locus  model,  two  recurrence-risk  pattern  were  derived; 
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A0  -  1  =  2(A2  -  1)  =  4(A3  -  1)  (2.6) 

and 

XM  =  4A5  -  2Ao  -  1;  (2.7) 

Similar  results  were  obtained  for  multiplicative,  additive,  and  genetic  heterogeneity 
two-locus  models  and  a  general  multilocus  model.  Important  conclusions  are:  "for  a 
single-locus  model,  Xr  —  1  decreases  by  a  factor  of  two  with  each  degree  of  relationship. 
.  .  .  For  a  multiplicative  (epistasis)  model,  \R  -  1  decreases  more  rapidly  than  by  a 
factor  of  two  with  each  degree  of  relationship.  Examination  of  \R  values  for  various 
classes  of  relatives  can  potentially  suggest  the  presence  of  multiple  loci  and  epistasis" . 
In  the  second  paper  (1990b),  Risch  derived  the  probabilities  of  IBD  scores  for 
a  completely  polymorphic  marker  for  several  relatives  under  the  assumption  that 
the  recurrence-risk  pattern  (2.6)  holds.    According  to  his  first  paper  of  this  series, 
this  assumption  might  be  violated  if  multiple  contributing  loci  present.    The  IBD 
distributions  were  summarized  in  Table  2.4.    He  concluded  that,  "for  diseases  with 
large  A  values  and  for  small  6  value,  distant  relatives  offer  greater  power.  For  large  9 
values,  grandparent-grandchild  pairs  are  best;  for  small  A  values,  sibs  are  best."  The 
third  paper  (1990c)  took  into  account  the  effect  of  marker  polymorphism  on  power. 
There  were  some  errors  in  the  third  paper  and  Risch  wrote  a  response  to  address  and 
corrected  them  (Risch  1992).  The  feasibility  of  his  method  is  dependent  on  whether 
a  suitable  control  group  is  included  in  the  study  or  an  appropriate  estimate  of  general 
population  risk  can  be  obtained. 

2.6     Test  Statistics  Based  on  IBD  Score 

There  are  several  test  statistics  based  on  IBD  score  of  the  sib-pair  method  to  detect 
linkage.   One  is  the  Two-allele  (proportion)  test  statistic,  which  is  the  number 
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Table  2.4.  The  probability  of  IBD  score  for  different  relatives. 


Marker  IBD 

Probability 

Affected  sib  pairs 

2 
1 
0 

|+i£(2*-l)[(As-l)+2*(As-Ao)] 

|-i(2*-l)2^(A5-Ao) 
i-£(2*-l)[(A5-l)  +  2(l-*)(A5-Ao)] 

Grandparent-grandchild  pairs 

1 
0 

|  +  |(1      20)(Ao'+1)(Ao      1) 
!-!(1-2*)7a£i7(Ao-1) 

Uncle(Aunt)-Nephew(Niece) 

1 
0 

I  +  I(1-0)(1-20)^(AO-1) 
I-I(1-0)(1-20)^(AO-1) 

Half-sib  Pairs 

1 
0 

l  +  H2^-l)(^TT(Ao-l) 

First-cousin  pairs 

1 
0 

J  +  [(i  -  ey  +  e\i  -  ey  +  \e-  -  Jl^Ao  - 1) 
|  -  [(i  -  ey  +  e\\  -  ey  +  \e*  -  i]j^(Xo  - 1) 

Where  6  is  the  recombination  fraction  between  gene  and  marker,  $  =  92  +  (1  —  6)' 
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(or  proportion)  of  sib-pairs  that  share  two  alleles,  proposed  by  Day  and  Simons  (1976) 
and  Suarez,  Rice,  and  Reich  (1978).  The  second,  the  Mean  test  statistic,  is  the 
sum  (or  sample  mean)  of  IBD  scores,  proposed  by  Green  and  Woodrow  (1977).  The 
third  is  to  use  the  Pearson  chi-square  goodness-of-fit  test  (Pearson,  1900)  on  the 
distribution  of  IBD  scores.  All  these  tests  assume  that  the  IBD  score  frequencies  are 
0.25,  0.5  and  0.25,  under  unlinked  conditions,  and  using  the  deviation  of  observed 
IBD  score  from  these  frequencies  to  detect  the  linkage. 

Several  papers  discuss  the  merit  of  these  test  statistics.  For  the  Two-alleles  test 
statistics;  the  one  proposed  Day  and  Simons  (1976)  was  based  on  the  supposition  that 
"the  probability  of  sharing  both  haplotypes  deviates  more  from  the  probability  under 
the  null  hypothesis  than  does  the  probability  of  having  one,  or  at  least  one,  haplotype 
in  common.  One  would  test  for  the  existence  of  the  DS  (disease  susceptibility)  gene 
by  comparing  the  observed  frequency  of  having  both  haplotypes  in  common."  Suarez, 
Rice,  and  Reich  (1978)  suggested  pooling  categories  IBD=0  and  IBD=1,  based  on 
their  generalized  sib-pair  IBD  distribution  that  we  mentioned  in  §2.4  "Distributions 
of  IBD".  They  demonstrated  that  the  affected-sib-pair  that  shared  two  alleles  had 
the  most  information,  noting  that  "with  minimal  loss  of  information,  we  can  pool 
categories  Pr{IBD=0,l}." 

Green  and  Woodrow  (1977)  used  the  total  number  of  "repeats,"  i.e.,  the  total 
number  of  IBD  scores,  as  the  test  criterion.  They  also  suggested  using  only  affected 
sib-pairs  because  the  distribution  of  IBD  scores  is  symmetric  under  null  hypothesis, 
so  the  sum  of  "repeats"  can  be  approximated  by  normal  distribution  quicker  than 
other  sampling  scheme. 

Blackwelder  and  Elston  (1985)  examined  the  Two-allele  test  statistic,  the  Mean 
test  statistics,  and  Goodness-of-fit  test  statistic,  by  calculating  the  exact  probabilities. 
They  concluded  that  the  Mean  test  statistic  is  generally  more  powerful  than  either 
of  the  other  two.    Schaid  and  Nick  (1990)  proposed  a  test  statistic  using  a  linear 
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combination  of  the  number  of  marker  alleles  IBD  as  a  test  statistic.  If  the  alternative 
IBD  distribution  can  be  specified,  then  this  test  statistic  can  be  optimized.  They  also 
suggested  a  tmax  test  statistic  which  was  the  maximum  of  the  Mean  test  statistic  and 
the  Two-alleles  test  statistic  and  showed  that  tmax  was  a  more  robust  test  statistic 
in  the  sense  that,  if  the  mode  of  inheritance  (recessive,  dominant,  and  so  on)  is 
misspecified,  loss  of  power  may  not  be  too  great. 

Faraway  (1993)  proposed  a  modified  chi-square  test.  This  test  is  a  special  case  of 
a  general  method,  x2  test  with  restricted  alternatives,  given  by  Lehmann  (1986,  p. 
480).  Faraway  showed  it  was  more  powerful  than  the  Two-alleles,  the  Mean,  Pearson 
chi-square,  or  Schaid  and  Nick's  tmax  tests  for  the  finite  sample  and  claimed  it  was 
the  asymptotically  uniformly  most  powerful  invariant  test.  More  details  are  given  in 
the  following:  For  testing  hypothesis 

Ho  :  po  -  p2  =  0.25,  pi  =  0.5     vs     Hx  :  p0  +  pi  +  p2  =  1. 

Let  the  chi-square  goodness  of  fit  test  statistics  be 

Y  =  An(p0  -  0.25)2  +  2n(pl  -  0.5)2  +  4n(p2  -  0.25)2, 

where  pi  is  the  sample  frequency  of  IBD=i,  n  is  the  total  sample  size.  Faraway  showed 
that  the  possible  value  of  IBD  score  distribution,  />,  i  =  1,2,3,  is  restricted  in  region 
F  that  is  the  intersection  of  pi  +  j»2  <  1,  Pi  <  1/2?  and  3pi/2  +  p2  >  1,  and  claimed 
this  region  is  equivalent  to  the  "possible-triangle"  in  Holmans  (1993).  Then,  showed 
under  this  restriction,  the  {p^l ',  p\l ',  p2>)  that  maximizes  statistic  Y  will  actually  turn 
Y  into  one  of  Two-allele  test  statistic,  Mean  test  statistic,  Y,  and  0.  The  results  are 
as  follows. 
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Region 

Test  statistic 

F 

2p2+P!  >l,pi  >  1/2 

3pi/2  +  p2<  l,p2  >  1/4 

2p2  +pi  <  l,p2  <  1/4 

K 

Mean 

Two-allele 

0 

All  these  discussions  regarding  the  merit  of  different  test  statistics  were  based  on 
a  model  where  the  trait  locus  has  only  two  alleles.  Since  the  distribution  of  IBD  score 
was  dependents  on  the  penetrance  of  the  genotype,  as  well  as  the  frequency  of  the 
genotype,  therefore,  whether  these  results  can  be  extend  to  a  model  that  assumes 
three  or  more  alleles  for  a  disease  susceptibility  gene  needs  further  investigation. 

Based  on  the  same  two-allele  model,  and  Suarez,  Rice,  and  Reich's  (1978)  gen- 
eralized sib-pair  IBD  score  distribution,  Knapp,  Seuchter,  and  Baur  (1994a)  showed 
that  if  fl  =  /0/2,  the  Mean  test  was  the  uniformly  most  powerful  test  in  9  (recom- 
bination fraction)  regardless  of  the  allele  frequency  of  t,  where  /o  is  the  penetrance 
of  the  genotype  tt  (t  is  the  susceptibility  allele  and  T  is  the  normal  allele),  /1  is  for 
the  genotype  Tt,  and  ji  is  for  the  genotype  TT.  It  is  clear  that  recessive  diseases 
(/1  =  fi  =  0)  satisfied  this  condition,  with  either  complete  or  incomplete  penetrance 
(/o  =  1  or  /o  <  1).  The  authors  also  proved  the  Mean  test  is  the  locally  most  powerful 
test  (there  is  no  test  with  a  larger  power  for  any  alternative  within  a  neighborhood 
of  H0  according  to  Rao,  1973),  irrespective  of  the  mode  of  inheritance.  He  stated, 
"Because  uniform  optimality  implies  local  optimality,  no  other  test  than  the  mean 
test  can  be  uniformly  most  powerful."  In  another  paper  (1994b)  Knapp,  Seuchter, 
and  Baur  presented  the  equivalence  of  the  mean  test  and  the  parametric  maximum 
LOD  score  analysis  with  an  assumed  recessive  mode  of  inheritance. 
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2.7     Likelihood  Ratio  Test 

The  generalized  likelihood  ratio  test  has  been  commomly  used  in  genetics  for 
detecting  linkage  and  heterogeneity.  The  LOD,  an  acronym  for  "logarithm  of  the 
odds  ratio"  Barnard  (1949),  is  a  special  version  of  the  likelihood  ratio.  However,  the 
classic  research  in  likelihood  ratio  test  cannot  be  directly  applied.  Let's  review  this 
theorem  first. 

In  Serfling's  book  "Approximation  Theorems  of  Mathematical  Statistics",  (1980, 
p.  144),  he  states  the  following: 

Regularity  Conditions  on  F .  Consider  0  to  be  an  open  interval  (not 

necessarily  finite)  in  R.  We  assume: 

(Rl)  For  each  9  6  0,  the  derivatives  .  .  .  , 

where  F  is  the  family  of  distribution  function,  and  0  is  the  parameter  space  of  F. 
Assuming  regularity  conditions,  it  can  be  shown  that  the  maximum  likelihood  es- 
timator of  9  has  an  asymptotic  normal  distribution  (Serfling,  1980,  p.  145).  With 
this  property  and  Lemma  A  (Serfling,  p.  153),  the  likelihood  ratio  test  theorem 
(Serfling,  p.  155)  can  be  proven  as: 

Consider  testing  H0:  9  =  do,  where  90  is  a  fc-dimensional  vector  in  0.  Let 

L(90) 


A„  = 


^Peee1^)' 


where  n  is  sample  size.  Then,  under  H0,  the  statistics  -21n  An  converge  in  distribution 
to  xl  (Serfling,  1980). 

The  purpose  of  regularity  conditions  was  to  make  sure  of  the  existence  of  the 
Taylor  expansion  of  the  distribution  functions  (Serfling,  p.  145).  The  parameter 
space  was  assumed  to  be  an  open  interval  in  R,  in  fact,  if  90  is  an  interior  point  of 
the  parameter  space  0,  the  theorem  holds  regardless  of  whether  the  parameter  space 
is  an  open  interval  or  not. 
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Let 

inn      i«.     max  o<e<o.5  L{9) 

LOD  =  log10       L{0  =  o5)      ,  (2.8) 

If  the  genetic  model  is  known  or  is  assumed  to  be  known,  some  people  use  LOD,  to 
to  test 

H0:  9=  0.5  vs  Hf.  9  <  0.5, 

where  9  is  the  recombination  fraction  between  a  marker  and  a  gene,  and  claimed 
21n(10)x  LOD  ~  Xi  (Chotai,  1984;  Elston,  1994;  Shute,  1988).  However,  this  is  in- 
correct because  9q  =  0.5  is  on  the  boundary  of  the  parameter  space  [0,  0.5].  Therefore, 
the  maximum  likelihood  estimator  does  not  have  an  asymptotic  normal  distribution. 
Similar  errors  happen  when  testing  heterogeneity,  where  the  parameter  space  is  [0,1], 
and,  under  null  hypothesis,  monogeneity,  i.e.,  a  =  1  is  on  the  boundary  of  parameter 
space  (Ott,  1983),  where  a  is  the  proportion  of  families  belonging  to  the  linked  group. 
Ott  (1991)  attempted  to  correct  this  error,  arguing  that  LOD  is  a  "one-sided"  test. 
Correct  asymptotic  distribution  of  maximum  likelihood  ratio  under  the  condition  that 
null  parameter  value  is  on  the  boundary  of  the  parameter  space  was  given  by  Self  and 
Liang  (1987).  The  asymptotic  distribution  of  Eq.  (2.8)  is  a  50:50  mixture  distribution 
of  a  \\  and  a  mass  equal  to  1  at  0.  To  make  inferences,  a  critical  value  of  LOD  score 
needs  to  be  decided.  For  simple  Mendelian  disease,  the  conventional  criterion  for 
claiming  linkage  is  LOD  >  3  (Morton,  1955,  1956).  The  probability  of  obtaining 
a  LOD  >  3  under  the  null  hypothese  is  about  0.0001.  One  reason  for  such  a  low 
significance  level  is  that  the  prior  probability  of  that  gene  being  within  a  certain 
distance  from  the  marker  is  small;  and  another  reason  is  that  we  would  rather  have 
no  linkage  map  than  have  a  wrong  linkage  map  (Morton,  1955,  1956).  Ott  (1994)  gave 
a  Baysian  argument,  asking  us  to  consider  two  hypotheses,  H0:  free  recombination 
or  absence  of  linkage  (9  =  0.5),  and  Hi:  9  <  0.5.  The  "prior"  probability  that  a 
gene  and  a  marker  are  within  a  measurable  distance  (40  CM)  is  small,  and  based  on 


32 


Elston  and  Lange's  result  in  1975,  he  said  the  probability  of  a  marker  and  a  gene 
being  within  40  cM  was  equal  to  0.02.  The  posterior  odds  for  linkage  is, 

P(Hx\data)       ( P(data\  Hj)\  /P(#i)\ 
P{H0\  data)  ~  \P{data\  H0) )  \P(H0)J  ' 

Since,  with  LOD=3  the  first  term  on  the  right-hand  side  is  about  1,000:1  and  the 
second  term  is  about  1:50,  the  posterior  odds  against  linkage  are  equal  to  0.05.  Since 
a  multiple  gene  effect  cannot  be  ruled  out  for  a  complex  trait,  he  added  one  more 
hypothesis,  the  "other"  hypothesis,  H2.  The  prior  probabilities  are  given  as  follows, 
with  h  being  the  prior  probability  of  H2. 


Hypothesis  Prior  probability 

Ho:  Single  gene,  no  linkage  0.98(1-/*) 

Hi:  Single  gene,  linkage  0.02(l-/i) 

H2:  "Other"  h 


The  posterior  odds  are  equal  to, 


PjH^data)  _    fP{data\Hl)\   I P(ffx) 


P{HX\  data)  "  \P{data\  HQ)  J  I  piH\+  \P(data\H2)l  p(ff 

y       v       U/  P(data\  Ho)  *■ 

where  Th  means  "not  Hi."  Let  R  =  fl^'l^i,  then, 

1  P(data\  Ho)  '  ' 

0.02(1  -h) 


) 


Q  = 


0.98(1  -  h)  +  hR  ' 


If  the  disease  in  the  family  has  an  prior  chance  of  90%  of  being  due  to  the  "other" 
mode  of  inheritance,  then  the  critical  value  has  to  increase  from  3  to  4  in  order  to 
retain  the  posterior  odds  at  0.05. 

The  above  Ott's  Baysian  argument  applies  to  a  single  marker  situation.  When  we 
do  a  genome-wide  search,  we  use  many  markers  to  do  multiple  tests.  While  the  overall 
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false  positive  rate  increases  due  to  the  nature  of  multiple  tests,  Ott  (1991)  indicated 
that,  because  the  prior  probability  also  increase  with  the  number  of  markers,  the 
critical  limit  could  remain  at  3.  He  also  reported  that  Lander  and  Botstein  (1989) 
investigated  the  problem  and  concluded  that  for  the  human  genome,  in  order  to  keep 
the  overall  significance  level  at  5%,  the  appropriate  critical  level  of  LOD  is  between  2 
and  3.  Later,  Lander  et  al.  (Lander  and  Schork,  1994;  Lander  and  Kruglyak,  1995) 
strongly  advocated  higher  LOD  thresholds  for  genome-wide  detection,  results  based 
on  dense  markers.  This  author  have  not  tried  to  verify  their  "dense  markers"  approach 
hence  will  not  include  their  results  here,  but  would  like  to  point  out  that  for  different 
methods  the  distribution  of  LOD  are  different.  Thus,  different  thresholds  should  be 
applied  to  different  methods  as  Lander  and  Schork,  and  Lander  and  Kruglyak  did. 

In  practice,  it  seems  that  most  researchers  forgot  the  other  part  of  Morton's  sug- 
gestion, that  is,  when  LOD  <  -2,  the  marker  should  be  excluded.  This  author  has 
not  seen  a  paper  published  using  the  method  of  exclusion. 

Clerget-Darpoux,  Bonaiti-Pellie,  and  Hochez  (1986)  studied  the  effects  of  misspec- 
ified  genetic  parameters  in  LOD  score  analysis.  They  reported  that  "the  power  of  the 
linkage  test  is  sensitive  to  the  degree  of  dominance,  and  slightly  to  the  penetrance, 
but  not  to  the  gene  frequency.  In  contrast,  the  estimation  of  the  recombination  frac- 
tion may  be  strongly  affected  by  an  error  on  any  genetic  parameter."  MacLean  et  al. 
(1993)  proposed  a  "MOD"  method  which  is  similar  to  LOD  but  it  not  only  maximiz- 
ing LOD  over  recombination  fraction  but  also  maximizing  over  inheritance  mode. 

2.8     Heterogeneity  and  Homogeneity 

There  are  two  types  of  heterogeneity  linkage  analysis.  One  is  allelic  and  the 
other  is  nonallelic;  the  latter  is  also  referred  to  as  locus  heterogeneity.  "With  allelic 
heterogeneity,  individuals  differ  from  each  other  by  having  different  alleles  at  the  same 
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locus  reponsible  for  the  disease;  in  nonallelic  heterogeneity,  however,  the  disease  is 
caused  by  different  loci"  (Ott,  1991).  We  will  discuss  only  nonallelic  heterogeneity 
in  this  section  because  there  are  more  than  one  recombination  fraction  that  can  be 
detected  by  linkage  analysis. 

There  are  several  methods  to  test  for  homogeneity.  These  methods  include  a 
method  proposed  by  Morton  (1956),  and  a  method  by  Smith  (1963).  Morton's  method 
to  test  homogeneity  is  to  test  whether  all  the  families  have  the  same  recombination 
fraction  or  have  different  recombination  fractions.  He  proposed  a  simple  statistic, 

k 

21n(10)x  [£*(*)-*(*)].  (2.9) 

where  0,  is  the  recombination  fraction  that  maximize  the  LOD  score  of  the  ith  family, 
and  Zi(Oi)  denotes  the  maximized  LOD  score,  i  =  1  to  k.  The  Z(9)  is  the  total  LOD- 
score  maximum  that  occurs  at  a  value  of  0  for  all  families  combined.  This  statistic  is 
assumed  to  have  a  x2  distribution  with  k  -  1  df . 

Smith  (1963)  assumed  that  there  were  two  groups,  a  linked  group  and  an  unlinked 
group.  The  linked  group  is  the  group  of  families  that  have  a  disease  gene  at  a  locus 
linked  with  the  markers.  The  members  of  unlinked  group  either  have  no  gene  that 
caused  the  disease  or  have  a  gene  at  a  locus  that  is  unlinked  to  the  markers.  Smith 
proposed  a  test  for  testing  whether  all  families  in  the  study  are  from  the  linked 
group.  If  e  is  the  true  but  unknown  proportion  of  families  belonging  to  the  linked 
group  with  recombination  fraction  0,  let  6  and  e  be  the  MLE  of  e  and  9.  Under  Hi, 
the  LOD  score  of  the  ith  family  is  given  as  Z,(e,  0)  =  log[eL,-(0)  +  (1  -  £)£<(§)].  The 
total  LOD  score  was  equal  to  Z(e,6)  =  ^Z,(e,0).  He  assessed  the  significance  of 
nonallelic  heterogeneity  by  the  statistic 

21n(10)  x  [Z(iJ\Hi)  -  Z(e  =  l,0\Ho)].  (2.10) 
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If  H0  is  true,  because  e  =  1  is  on  the  boundary  of  parameter  space,  the  statistic  follows 
a  50:50  mixture  distribution  of  a  chi-square  distribution  with  1  df  and  a  mass=l  at 
0,  not  a  chi-square  distribution  with  1  df  as  originally  suggested. 

Ott  (1983)  compared  Smith's  method,  which  he  called  the  A-test,  and  Morton's 
method,  which  he  called  the  PS-test  (M-test  in  his  1991  book).  For  a  mixed  situation 
(families  with  or  without  linkages  between  gene  locus  and  marker  locus),  Ott  com- 
pared the  M-test  with  the  A-test  and  found  that  the  A-test  was  generally  superior. 
However,  since  he  erred  on  the  distribution  of  the  test  statistics  (2.10),  his  conclusion 
might  need  modification. 

Using  IBD  scores,  Chakravart,  Badner,  and  Li  (1987)  proposed  a  method  to  test 
linkage  and  homogeneity  using  IBD  scores  in  the  case  of  an  autosomal  recessive  gene. 
The  test  was  basically  a  goodness-of-fit  test.  They  concluded  that  while  the  power  of 
their  method  for  detecting  linkage  from  sib-pair  data  is  excellent,  that  for  detecting 
the  heterogeneity  of  linkage  is  not.  Proceedural  details  are  as  follows. 

For  an  autosomal  recessive  disease,  the  genotype  of  unaffected  parents  in  multiplex 
families  are  Dd  x  Dd,  where  D  and  d  are  the  normal  and  disease  alleles,  respectively. 
Among  the  affected  sib-pair,  let  the  probabilities  of  marker  IBD  =2,  1,  and  0,  be, 

P2  =  Pr(IBD  =  2)  =  y2,  (2.11) 

Px  =  Pr(IBD  =  1)  =  2xy,  (2.12) 

Po  =  Pr(IBD  =  0)  =  x\  (2.13) 

where  x  =  20(1  -  6)  <  0.5  and  y  =  1  -  x.  The  maximum  likelihood  estimator  (MLE) 
for  x  is 

n\  +  2n0 


x  = 


2n 


(2.14) 


where  n,  are  the  numbers  of  affected  sib-pair  sharing  i  markers  IBD,  and  n  =  £n,-. 
Because  true  value  should  not  be  greater  than  0.5,  they  proposed  a  new  estimator, 
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u,  where, 


u 


x        if  x  <  0.5 
0.5     if  i  >  0.5. 


This  new  estimator  has  a  smaller  variance  under  both  null  and  alternative  hypotheses 
than  x's.  The  recombination  value  may  be  estimated  by 


.       1  -y/l-2u 

0  = 


Under  genetic  heterogeneity,  the  IDB  score  distribution  will  be  a  mixture  of  two 
binomial  distributions  under  6  <  0.5  and  6  =  0.5  in  the  proportions  c:\-c.  Thus 
the  /\'a  in  (2.11)  through  (2.13)  become 

P2  =  c(l-x)2  +  (l-c)/A,  (2.15) 

Px  =2cx(l-x)  +  (l-c)/2,  (2.16) 

P0  =  cx2  +  (l-c)/4.  (2.17) 

Solving  the  above  equatons  ,  they  got  c  =  ppjpP^P  ■  Replacing  P{  with  its  MLE,  they 
obtained  the  MLE  for  c,  and  x  under  heterogeneity 

c    =       /n2-no)2    ,  (2.18) 

n(n2  —  ni  +  no) 

_  n2  -  2n0 

Xh    =     Ti \  \1A^> 

2(n2  -  n0) 

Then,  they  proposed  a  two  stage  method  to  test  linkage  and  heterogeneity.  First 
they  tests  linkage,  if  the  null  hypothesis  of  no  linkage  was  rejected  then  they  conclude 
heterogeneity.  In  the  first  stage,  the  statistic 

T  =  2v/(2n)(x  -  i) 
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was  used  and  the  authors  claimed  that  T  had  asymptotic  normal  distribution.  They 
did  not  explain  why  not  using  u. 
The  statistic 

2  2  2 

G  =    n       v  +  ^T^T — V  +  —2  -  n  2-20 

n(l  —  xy       lnx\\  —  x)       nxl 

is  used  to  test  heterogeneity.  The  authors  claimed  G  is  asymptotically  distributed 
as  a  x2  variable  with  df  1.  However,  they  did  not  prove  it  nor  point  out  what  value 
should  be  used  for  x;  use  x  or  x~h.  Monte-Carlo  simulations  were  used  to  evaluate  the 
power. 

2.9     Polygenes 

There  are  many  traits  in  a  population  that  have  more  variation  and  can  not  be 
categorized  into  distinct  classes  easily  (Klug  and  Cummings,  1997).  Traits  exhibiting 
continuous  variation  may  be  controlled  by  two  or  more  genes.  Such  traits  are  said  to 
exhibit  continuous  or  quantitative  variation  and  are  examples  of  polygenic  inheri- 
tance. The  hypothesis  suggesting  a  large  number  of  factors  or  genes  were  responsible 
for  continuous  phenotype  are  called  the  multiple-factor  or  multiple-gene  hypoth- 
esis. These  genes  have  a  special  name,  polygenes  (Griffiths  et  al.,  1993).  Klug  and 
Cummings  (1997,  p.  95)  summarized  some  major  characteristics  of  multiple-factor 
hypothesis: 

1.  Characters  controlled  by  multiple- factors  can  usually  be  quantified  by  measur- 
ing, weighing,  counting,  etc. 

2.  Two  or  more  pairs  of  genes,  located  throughout  the  genome,  account  for  the 
hereditary  influence  on  the  phenotype  in  an  additive  way.  Because  many  genes 
may  be  involved,  inheritance  of  this  type  is  often  called  polygenic. 
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3.  Each  gene  locus  may  be  occupied  by  either  an  additive  allele,  which  contributes 
a  set  amount  to  the  phenotype,  or  by  a  nonadditive  allele,  which  does  not 
contribute  quantitatively  to  the  phenotype. 

4.  The  total  effect  of  each  additive  alleles  at  each  locus,  while  small,  it  approxi- 
mately equivalent  to  all  other  additive  alleles  at  other  gene  sites. 

5.  Together,  the  genes  controlling  a  single  character  produce  substantial  pheno- 
typic  variation. 

6.  Analysis  of  polygenic  traits  requires  the  study  of  large  numbers  of  progeny  from 
a  population  of  organisms. 

If  there  are  more  than  two  genes  controlling  the  phenotype,  then  the  number  of 
genes  involved  and  the  effects  of  those  genes  are  important  to  know.  To  address  this, 
Tan  and  Chang  (1972)  proposed  a  method  for  estimating  the  number  of  genes  for 
self-fertilized  populations  assuming  that  there  were  only  two  alleles  for  each  gene  and 
the  effect  of  each  gene  is  the  same.  This  work  was  further  expanded  on  by  Tan  and 
D'Angelo  (1979)  to  estimate  the  numbers  and  effects  of  major  genes  and  polygenes 
assuming  that  all  major  genes  have  the  same  effect  and  all  polygenes  have  the  same 
effect.  Because  their  work  were  developed  for  self-fertilized  population,  we  will  not 
cover  the  details. 

2.10     Two-stage  Genome  Search 

Using  the  DNA-level  genetic  markers  technology  proposed  by  Botstein  et  al. 
(1980),  Lander  and  Botstein  (1986)  concluded  that  the  affected-sib-pair  method  can 
be  used  for  a  genome  search  for  disease  genes.  A  two-stage  design,  first  a  screening 
search  to  eliminate  nonviable  marker  loci,  and  then  an  intensive  search  to  identify 
gene  location  is  an  intuitive  design.    Although  this  design  has  already  been  used  in 
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genome-wide  searches,  such  as  those  by  Davies  et  al.  (1994)  and  Luo  et  al.  (1995), 
no  optimality  of  their  designs  was  discussed.  Thus,  there  were  no  guidlines  on  how  to 
space  the  markers  or  how  to  allocate  the  available  ASP  in  each  stage.  While  Elston 
(1992,  1994)  studied  the  "optimal"  two-stage  design  and  concluded  that  two-stage 
designs  are  more  efficient  than  one-stage  designs,  he  did  not  consider  the  statisti- 
cal complexity  of  the  multiple  test  nor  he  considered  the  interval  search  nature  in 
genome-wide  linkage  detection  or  the  resource  constraints.  In  another  paper,  Brown 
et  al.  (1994)  studied  the  multiple-stage  approach  for  genome  search  using  affected- 
pedigree-member,  by  simulation.  Since  Brown  et  al.  used  only  simulation  and  only 
one  pedigree  was  considered,  it  is  not  obvious  how  to  apply  their  results  to  the 
ASP  method.  Darvasi  and  Soller  (1994)  have  studied  the  optimal  spacing  of  genetic 
markers  for  the  QTL  trait  without  considering  the  two-stage  approach.  Holmans 
and  Craddock  (1997)  conducted  a  simulation  studied  on  the  efficient  strategies  for 
genome  scanning  using  maximum-likelihood  affected-sib-pair  analysis.  The  situation 
they  considered  are:  a  200  affected  sib-pairs  sample,  four  different  sample  allocation 
strategies,  and  five  grid-tightening  strategies.  The  risk  ratio  of  sibs  are  equal  to  2  or  3, 
each  marker  locus  have  five  equi-frequent  alleles,  and  there  are  five  possible  location 
of  genes.  Since  their  studies  were  simulation  studies,  it  is  not  clear  how  to  generalize 
their  results.  Furtheremore,  they  did  not  consider  different  sample  size,  nor  optimal 
strategies  under  resource  constrains. 

The  main  goals  of  this  dissertation  is  to  answer  the  two  stage  design  question  as 
they  apply  to  rare  autosomal  recessive  and  dominant  diseases  with  a  dichotomous 
phenotype  (affected  or  not  affected). 


CHAPTER  3 
TWO-STAGE  GENOME  SEARCH  FOR  SIMPLE  MENDELIAN  DISEASE 


In  a  two-stage  search  method,  the  first  stage  uses  part  of  the  ASPs  with  a  wide 
spread  markers  in  the  genome.  The  rest  of  the  resource  is  to  be  used  on  those 
promising  markers  identified  in  the  first  stage.  In  this  study,  only  the  least  favorable 
configuration,  that  gene  that  lies  in  the  middle  of  two  adjacent  markers,  will  be  used 
for  constructing  the  designs. 

There  are  three  major  statistical  problems  involved  in  deriving  analytic  solutions 
are  summarized  as  follows: 

1.  Because  all  the  markers  on  the  same  chromosome  are  linked,  the  IBD  scores 
are  not  independent.  In  the  case  of  a  genome  search,  there  are  many  markers 
spread  along  the  chromosomes.  We  have  to  handle  a  high-dimensional  joint 
distribution. 

2.  In  the  first  stage,  a  number  of  loci  will  be  chosen  for  the  more  intensive  linkage 
analysis  in  the  second  stage.  The  number  and  position  of  these  loci  that  pass 
to  the  second  stage  are  random.  This  makes  calculation  of  exact  distribution 
in  the  second  stage  extremely  difficult. 

3.  Since  in  the  first  stage  only  a  small  number  of  ASPs  and  a  large  number  of 
markers  are  used,  there  will  be  many  ties  in  the  IBD  scores.  The  asymptotical 
solutions  using  a  continuous  normal  distribution  cannot  handle  these  ties. 
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In  this  dissertation,  Problem  3  was  handled  by  a  multinormial  distribution.  We  used 
independent  model  as  an  approximation  for  Problem  1  and  2,  and  checked  the  results 
with  simulation. 

3.1     Assumptions 

1.  There  is  one  and  only  one  disease  gene  that  increases  the  probability  of  an 
individual  being  affected.  However,  the  disease  may  have  nongenetic  causes. 

2.  Highly  polymorphic  equally  spaced  markers  are  available.  When  m  markers  are 
assigned  in  the  first  stage,  their  positions  are  £-  •&-,  ....  (2t"-1)l   where  L,  the 

°  °    '  r  2m    2m '  2m        ' 

length  of  the  genome,  is  equal  to  3300  cM  (Lewin,  1990). 

3.  For  a  rare  autosomal  recessive  disease,  the  parents'  disease  genotypes  are  D\d 
and  D2d;  and  for  a  rare  dominant  disease,  the  parents'  disease  genotypes  are 
Did  and  D2D3,  where  d  is  the  disease  gene. 

4.  In  the  same  family,  the  probability  that  the  disease  of  one  affected  sib  is  caused 
by  a  gene  and  the  other  is  not  is  negligible. 

5.  The  cost  of  typing  alleles  is  a  constant,  i.e.,  the  cost  of  typing  A;  markers  from 
one  person  is  the  same  as  typing  one  marker  from  k  persons.  This  assumption 
of  cost  ratio  can  be  relaxed  to  suit  practical  situations,  but  the  numerical  result 
would  need  to  be  re-calculated. 

6.  We  use  the  Two-alleles  statistic  (Day  and  Simons,  1976),  i.e.,  let  Xi}=l  if  ith 
marker  IBD  score  =  2  for  the  jth  sib-pair,  and  Xij=0  otherwise.  In  our  analytic 
approach,  the  Xij  are  assumed  to  be  independent  random  variables,  except  the 
two  X{j  adjacent  to  the  gene. 
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3.2     Two-Stage  Procedure 

Suppose  there  are  n  ASPs  and  there  are  enough  resources  to  type  N  marker  loci. 
Three  numbers  need  to  be  determined  in  a  two-stage  design:  n\  and  m,  the  number 
of  ASPs  and  the  number  of  markers  to  be  used  in  the  first  stage  (Stage  I),  and  r,  the 
number  of  markers  to  be  used  in  the  second  stage  (Stage  II).  The  markers  chosen  for 
the  second  stage  are  based  on  the  statistic  used  by  Day  and  Simons, 

wi  =  2_^  X*ji  (3-1) 

in  the  first  stage,  where  X{j  is  defined  in  Assumption  6.  Ideally,  the  r  markers  with 
the  highest  scores  are  chosen  for  Stage  II.  However,  in  the  event  of  ties,  more  than  r 
of  them  may  have  to  be  chosen.  Thus,  #,  the  actual  number  of  markers  used  in  Stage 
II,  is  a  random  variable.  The  formal  definition  of  R  is:  R  >  r,  but  if  the  marker(s) 
with  the  lowest  score  in  this  group  (markers  for  stage  II  study)  is  (are)  taken  away, 
then  the  total  number  of  remaining  markers  is  smaller  than  r. 

In  Stage  II,  R  markers  on  N2  ASPs  are  to  be  typed,  where  N2  is  the  largest 
number  of  sibs  that  can  be  used  subject  to  the  resource  constraint.  Since  R  is  a 
random  variable,  N2  is  also  a  random  variable.  Without  loss  of  generality,  let  the  R 
markers  be  0,1,  ...,  R—l,  and  the  sib-pairs  in  the  Stage  II  be  ni  +  l,ni+2,  ...,ni  +  N2. 
Thus,  N2  is  the  largest  x  such  that  rani  +  Rx  <  N.  Define 

ni+N2 

2Si=    Y,    X*i  (3-2) 

j=ni+l 

Then  the  marker  with  the  uniquely  highest  2S{  is  claimed  to  be  the  locus  nearest  the 
disease  gene.  If  two  adjacent  markers  have  the  same  highest  score,  then  the  gene  is 
claimed  to  lie  between  them;  otherwise,  the  gene  location  is  undetermined.  In  this 
case,  the  two  markers  are  called  a  marker  group. 
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Once  the  location,  /,  has  been  chosen,  say  I  =  i,  the  next  step  is  to  check  whether 
we  can  claim  linkage.  Let  <Q,„2,/?  be  the  100(1  -  a)  percentile  of  the  unique  maximum 
of  R  binomial  (n2,  0.25)  random  variables.  If  2S/  >  £Q,n2,.R>  then  we  claim  there  is  a 
linkage  at  loci  i  at  significance  level  a. 

3.3     Probability  of  Allocating  the  Correct  Marker 

3.3.1     Analytic  Approach 

To  make  the  analytic  solution  tractable,  Assumption  (6)  is  made  on  the  distribu- 
tions of  Xij.  They  are  later  compared  with  a  more  realistic  situation  by  simulation. 

Let  markers  be  numbered  from  0  to  m  —  1 .  Without  loss  of  generality,  if  the  gene  is 
at  the  end  of  the  chromosome,  we  let  the  nearest  marker  be  marker  0;  and  if  the  gene 
is  between  two  markers,  we  let  it  be  located  in  the  middle  of  marker  0  and  1.  If  we 
assume  that  gene  location  is  uniformly  distributed  along  the  genome,  the  probability 
of  the  gene  at  the  end  is  1/m. 

Let  iS(r)  be  the  rth  largest  statistic  in  the  set  {iS,-}  in  the  first  stage.  The 
probability  of  finding  the  gene  by  the  two-stage  method  is, 

F(The  marker  closest  to  the  gene  is  found) 
=    F(Gene  is  at  the  end  of  the  chromosome,  marker  0  is  chosen  in  Stage  II)   (3.3) 
+  P(Gene  is  not  at  the  end,  either  marker  0  or  1  is  chosen  in  Stage  II).     (3.4) 

Eq.  (3.3)  can  be  written  as  a  sum  of  the  probabilities  of  three  exclusive  events; 
Eq.  (3.5)  through  Eq.  (3.7). 

F(Marker  0  is  chosen  at  the  end  of  Stage  II  |  gene  is  at  the  end) 


m 

1 
m 


P(i50  passes  Stage  I  and  2So  is  the  highest  in  Stage  II) 
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1    ni 
=     —  >     PiiSo  passes  Stage  I,  260  's  the  highest  in  Stage  II,  iS(r_D  =  k) 
m  ^— '  v      ' 


fc=o 


m 


1  ■»   - 
—  2J  {■Pd'S'o  >  iS(r- 1)1  i5(r_j)  =  k,  2S0  is  the  highest  in  Stage  II)  (3.5) 


fc=o 


+  P[k  =  i5(r_j)  >  \Sq  >  i5(r),  2<5o  is  the  highest  in  Stage  II)  (3.6) 

+  P{k  =  i5(r_!)  >  i50  =  i5(r),  2^0  is  the  highest  in  Stage  II)}  .  (3.7) 


Eq.  (3.5)  is  equal  to, 


P(iSo  >  i5(r_!),  l&r- 1)  =  ^)2'S'o  is  the  highest  in  Stage  II) 
(assuming  independence) 

E  £ 

i       i  P{iSo  >  k,      i  iS/s  >  k,      I  \Sj's  =  k, 

,  +  l  >  (r-l),  i  <  (r-1) 

(m  -  1  -  i  -  I)  iS/s  <  k,     2So  >  max  2St) 

(#0 


r  — 2    m  —  1—  i 


r — £.     in—i  —  i      f 

"    E     E       PUSo>A;) 

i=0  fcr-l-!    *■ 


m 


1! 


t!/!(m-l -»-/)! 


v/ 


P(i5j  >  k)1    P^Sj  =  k)'     P^Sj  <  k) 

N2 


m  —  l—i—l 


t=l 


Y,P(2S0  =  t)P(2sJ<ty+< 

N  —  n\m 


where  N2 


i  +  /  +  1 


Int 


Eq.  (3.6)  is  equal  to, 


P{iS(r_x)  =  k  >  iSo  >  iS(r),  gene  is  found) 


Jfc-2 


=     /]P{lS(r-i)  =  k  >  x50  >  i5(rj  =  z,  2So  is  uniquely  highest) 


t=0 


fe-2    r-l    (m-l)-(r-l) 

=    EE        E       P(*  =  i5(-i)  >  1^1  >  i5(r)  =  *,     (r  -  1  -  0  lS/«  >  *> 

i=0     1=1  /i=l 
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I  iS/s  s  jfc,     ft  iS/s  =  •*,      (m  -  r  -  ft)  iS/s  <  t,  2S'o  is  uniquely  highest) 

£2  «-l  HH'-i)  j  ^^ 

~     E  E         2_.         \P(A;>l5o>z)  (r-l -/)!/!  ft!  ((m-l)-(r- 1)- ft)! 

1=0    /=1  /i=l 

PiiSj  >  ky-1-1  PitSj  =  *)'  P(tSj  =  i)h  p{iS3  <  ir-i-(r-D-i 
Y^p(2s0  =  t)P(2sJ<ty-1 


t=i 


where  N2  = 


J  Int 


Eq.  (3.7)  is  equal  to, 

F(i5(r_i)  =  k  >  iSq  =  i5(r),  gene  is  found) 


k-i 


=     ^P(iS(r_!)  =  A;  >  1S0  =  i5(r)  =  i,  2^0  is  uniquely  highest) 

1  =  0 

k-l    r-1      (m-l)-(r-l) 

=  S  J2      H     p^k  =  iSr~i  > iSi  =  i5r  =  ^  (r  - x  -  0  15/S  >  fc> 

,=0    (si  /i=l 

/  jSy'a  =  fc,  h  iS/s  ss  t,  (m-  r  -  h)  iS/s  <  i,  2S0  is  uniquely  highest) 

m-  1! 


fc-l    r-l    (m-l)-(r-l) 


(r  -  1  -  /)!  /!  ft!  ((m  -  1)  -  (r  -  1)  -  ft)! 
t=0    /=1  h=l  v  f  \\  t       \  t  1 

PiiSi  >  ky-1-'  p(iSj  =  k)1  PdSj  =  i)h  p(iSj  <  £)«*-i-(r-i)-i 

■Na 


t=\ 


Y,P(2S0  =  t)P(2sJ<ty-1+h 

N  —  n\m 


where  N%  = 


Eq.  (3.4)  can  be  written  as 


r  +  h 


Int 


771—1 
771 

771—1 
771 


F(Marker  0  or  1  is  chosen  at  the  end  of  Stage  II  |  gene  is  not  at  the  end) 
{P(iSq  passes  Stage  I  and  1S1  does  not,  2S0  is  tne  highest  in  Stage  II)  (3.8) 
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+P(i5i  passes  Stage  I  and  i5o  does  not,  2-51  is  the  highest  in  Stage  II)  (3-9) 

+P(\So  and  i5i  pass  Stage  I  and  2S0  or  2S1  is  the  highest  in  Stage  II)}.       (3.10) 

Since  the  gene  is  assumed  to  be  in  the  middle  of  two  markers,  Eq.  (3.8)  is  equal 
to  Eq.  (3.9)  and  they 

=     P(\So  passes  Stage  I  and  i5i  does  not,  2^0  is  the  highest  in  Stage  II) 
=    \_]P(iSq  passes  Stage  I  and  i5i  does  not,  2S0  is  the  highest  in  Stage  II, 


k=\ 


Til 


iS(r_i)  =  fc) 


=    Yj  {P{iSo  >i5(r_!)  =  k  >  jSi,  2-50  is  the  highest  in  Stage  II) 


(3.11) 


lbs] 


+  F(i5(r_1)  =  k  >  \Sq  >  i5(rj,  i5o  >  1S1,  2S0  is  the  highest  in  Stage  II) (3.12) 
+  P(i5(r_!)=A;  >  iSq  =  i5(r),  i5o  >  iS\,  2S0  is  the  highest  in  Stage  II) }(.3.13) 


Eq.  (3.11): 


P{iSo  >i<S(r-i)  =  k  >  i5i,  2S0  is  the  highest  in  Stage  II) 
P(i50  >  ac,  i  tS'jS  >k  ,1  ify  =  fc, 


EE 

i    / 

i  +  l  >  (r-  1),  .  <  (r-  1) 


(m-2-i-l)  iS';S  <  k,  x5i  <  k  250  >  max25t  ) 


r-2         m-2-t     , 

E    E 

t'=0      ;=r-i-»  v 


P(iS0>*,iSi<*)- 


m-2! 


»!/!(ro-2-*-Z)! 
P(i5j  >  fc)1  PfcSj  =  fc)'  P(i5>  <  fc)"1"2-'-' 

2 

^P(25o  =  i)P(25,  <t)«'+' 

iV  -  nim~ 


N. 


L  t=l 


where  N2 


i  +  l  +  l 


Int 
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Eq.  (3.12)  is  equal  to, 

P(k  =i  S(r_!)  >  i50  >  iS(r)  =  t,  iSq  >  iSi,  2S0  is  the  highest  in  Stage  II) 

k-2    r-l    (m-2)-(r-l) 

=    *P  T^        ^       P(&  =  iS(r_i)  >  1S0  >  i5(r)  =  i,  i50  >  1S1, 

t'=0    (=1  /i=l 

(r  -  1  -  I)  iS/s  >  k,  I  iSj's  =  k,  h  iSj»  =  »,  (m  -  r  -  1  -  h)  t5/«  <  i, 

250  >  max  2St) 
</o,i 

fc_2        r-l         (m-2)-(r-l) 

=  E  E      E 


=0  /=1  /l=l 


m-2! 


|p(*  >  x50  >  i,  1S0  >  i5i)(r_1_/)!/!  w((m_2)_(r_1)_^)! 
P(i5i  >  ^)r-x-'  P(iSj  =  A)'  Pd5,  =  0*  P(iSj  <  i)m-r~l-h 

N2 

Y/p(2s0  =  t)P(2sJ<ty-1 

N  —  n\m 


t=\ 


where  iV2  — 


Int 


Eq.  (3.13)  is  equal  to, 


k-2 


^P(fc  =1  5(r_1)  >  iStj  =  1%)  =  »,  i50  >  i5i,  250  is  the  highest  in  Stage  II) 


i=i 


fc-2    r-l    (m-2)-(r-l) 

^    ^  2J  f(fc  =  lS(r-l)   >  1^0  =  lS(r)  =  »I    1^0  >  l5l, 

i=l     /=1  /i=l 

(r  -  1  -  /)  1S/s  >  fe,  /  i5/s  =  k,h  iS/s  =  i,  (m  -  r  -  1  -  h)  iSj's  <  i, 

250  >  max25i) 
^0,1 

k-2        r-l         (m-2)-(r-l) 

E  E      E 

t=l         1=1  h=l 

lp(iS0  =  i,lSi<i) 


(r-l-/)!/!  /i!  ((m-2)-(r-l)-/i)! 
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PiiSj  >  k)r~1-1  P{tSj  =  k)1  P^Sj  =  i)h  PixSj  <  i) 


m—> — 1  —  h 


Ni 


t=\ 


J2P(2So^t)P(2sJ<ty-1+h 

N  —  n\m 


where  N2 


r  +  h 


Int 


Eq.  (3.10)  is  equivalent  to 

F(i5o  and  iSi  pass  Stage  I,  gene  is  found  in  Stage  II) 
=     >J{F(i5o  >i5(r_2)  =  k,  iS\  >i5(r_2)  =  k,  gene  is  found  in  Stage  II) 


+  P( 

+  P( 

+  P( 

+  P( 

+  P( 

+  P( 

+  P( 

+  P( 

+  P{ 

+  P( 

+  P( 

+  P( 

iS0  >i5(r_2)  =  k  >  jSi  >  i5(r_x 
i5o  >i5(r_2)  =  k  >  i5i  =  i5(r_! 
i5i  >i5(r_2)  =  k  >  iS0  >  i5(r_x 

l5i   >l5(r_2)  =  k  >  jSo  =  l^r-i 

i5(r_2)  =  k  >  x50  >  i5i  >  i5(r_ 
i5(r_2)  =  k  >  iSo  >  i5i  =  i5(r_ 
i5(r_2)  =  k  >  i5i  >  i5o  >  i5(r_ 
i5(r_2)  =  k  >  iS\  >  iSq  =  i5(r_ 
i5(r_2)  =  k  >  i5i  =  i5q  >  i5'(r_ 


,  gene  is  found  in  Stage  II) 
,  gene  is  found  in  Stage  II) 
,  gene  is  found  in  Stage  II) 
,  gene  is  found  in  Stage  II) 
),  gene  is  found  in  Stage  II) 
),  gene  is  found  in  Stage  II) 
j,  gene  is  found  in  Stage  II) 
),  gene  is  found  in  Stage  II) 
w  gene  is  found  in  Stage  II) 
),  gene  is  found  in  Stage  II) 


i5(r_2)  —  k  >  iS\  —  \Sq  —  i5(r_ 

i5(r_!)  =  k  >  \S\  —  iSq  >  iS(r),  gene  is  found  in  Stage  II) 

iS(r_!)  =  k  >  iS\  =  iSq  =  i5(r),  gene  is  found  in  Stage  II)} 


3.14) 

3.15) 
3.16) 
3.17) 
3.18) 
3.19) 
3.20) 
3.21) 
3.22) 
3.23) 
3.24) 
3.25) 
3.26) 


Again,  because  the  gene  is  in  the  middle  of  two  markers,  Eq.  (3.15)  is  equal  to 
Eq.  (3.17),  (3.16)  is  equal  to  (3.18),  (3.19)  is  equal  to  (3.21),  and  (3.20)  is  equal  to 
(3.22). 
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Eq.  (3.14)  is  equal  to, 

P(iSo  >  i5'(r_2)  =  k,  1S1  >  i5(r_2)  =  k,  2S0  is  uniquely  highest  in  Stage  II) 
+      P{iSo  >  i5(r_2)  =  k,  1S1  >  i5(r_2)  as  k,2S\  is  uniquely  highest  in  Stage  II) 
+      P(iS0  >  i^(r-2)  =  k,  1S1  >  i5(r_2)  as  fe,  250  ss  2Siis  the  highest) 
=      P{iSo  >  i5'(r_2)  =  k,  1S1  >  i5'(r_2)  «=  fc|  250  or  2  Si  is  uniquely  highest  in  Stage  II) 

* v ' 

(A) 

+      P{iS0  >  i5(r_2)  =  k,  iSi  >  i5(r_2)  =  k,  2S0  =  2S{is  the  highest) 


(A) 


(B) 


E  E 

•       /  P(iS0  >  k,  i5i  >  A;,  i5(r_2)  =  fc,j  iS/s  >  k, 

i  +  l  >  (r-2),  •  <  (r-2) 

/  iS/s  =  fc,  (m  —  2  —  z  —  /)  iS/s  <  Ac,  2-50  or  2S1  is  uniquely  highest  in  Stage  II) 

r— 3     m—2—i 


r—o     m  —  L  —  i     f 

=     Yl    £    {j*(i&£*iiSi£*); 

t=0    /=r-2-i    *■ 


m-2! 


i!/!  ((m-2) -»'-/)! 
P(i5>  >  fc)'  P(i5j  =  fc)'  P(i5j  <  k)m-2-z~l 


P(2S0  >  max25(,25'o  >  2Si)  +  P{2Si  >  max2S<,  2 Si  >  2S0)     > 
</o,i  (#0,1  J  J 


r  — 3     m  —  2  —  i 


r  —  o      m  —  4  —  1      • 
1=0    /=r-2-!    ^ 


m-2! 


i\l\   {{m-2)-i-l)\ 


p{xSj  >  ky  p^Sj  =  ky  PdSj  <  ky 

N2 


-2-i-l 


J2p(2S0  =  t,2s1<t)P(2sJ<ty+l 

N  —  n\m 


t=\ 


1 


where  N2  = 


lr  +  l  +  2\Int 


(B) 


EE 


i     /  P(iS0  >  k,  1S1  >  k,  iS(r_2)  —  k,i  iS/s  >  k, 

i  +  l  >  (r-2),  1  <  (r-2) 
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/  yS/s  =  k,  {m-2-i-l)  iS/s  <  k,  2S0  =  2Si  is  uniquely  highest  in  Stage  II) 


r-3     m-2-i 


r-o     m-t-i     ,  m-2' 

=    £    £   p^.A^),, f!((m_2) I,.,), 


i=0    l-r-2- 


p^Sj  >  ky  pus,  =  k)<  PdSj  <  ky 


P^So  =  2S1  >  max  2St 

t#0,l 


r— 3     771— 2— t     /• 

£      £     |p(i50>fc,i5i>A) 

:—f\     1 n     ."    l> 


i-2-i-l 


m-2! 


i!/!   ((m-2)-t-/)! 
P(i5j  >  *)'  P(iS>  =  *)'  P(iSj  <  k)m~2-'-1 

J2p(2So  =  t,2Sl=t)P(2SJ<t)r+l 

N  —  n\m 


t'=0    l=r-2-i 


t=\ 


where  N2  = 


[r  +  l  +  2\Int 


(A)  +  (B) 


r — 3     m  —  2—i     s 

E  E 

i=0    l=r-2-i   K 


P{iS0>k,iSi>k)- 


m-2! 


t!/!   ((m-2) -i-/)! 


PdSj  >  lb)'  P(i5j  =  ky  PdSj  <  ky 

N2 


J][2P(25o  -  t,  25i  <  0  +  P(250  =  t,  jS,  =  t)]P(25,  <  0 

JV  —  nim 


r+J 


.  fc=l 


where  iV2 


L  r  +  /  +  2  J  /Tl( 


Eq.  (3.15)  and  (3.17)  are  equal  to 

P{iS0  >  i5(r_2)  =  k  >  1S1  >  i5(r_!),  gene  is  found) 

fc-2 
=    ^{P(i50  >  iS(P_2)  =  fc  >  i5i  >  i5(r_x)  =  i,  2S0  or  2SX 


1=0 


is  uniquely  highest  in  Stage  II) 
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+P(\So  >  i5(r_2)  =  k  >  i5i  >  i5(r«i)  a  t,  2'S'o  =  2-51  is  the  highest  in  Stage  II) 

k-2    r-l    (m-2)-(r-2) 

=    53  X]        E       {^(i^o>*,ib>i5i>t,(r-l-0i5/«>JbJ 

/  x5j's  =  k,h  iS/s  =  i,  (m  —  r  —  1  —  h)  \S3'  s  <  i,  2S0  or  2S1 
is  uniquely  highest  in  Stage  II) 
+F(i50  >  k,  k  >  i5i  >  i,  (r  -  1  -  /)  L5/«  >  k, 

I  iS/s  =  k,h  iS/s  =  i,  (m  —  r  -  1  -  h)  iS/s  <  i,  2S0  =  2S1 
is  the  highest  in  Stage  II)} 

k-2    r-l    (m-2)-(r-2) 

=  EE     E 

i=0     /=1  h=l 


P{1S0>k,k>1S1  >i) 


m-2! 


(r  -  2  -  /)!  l\  h\  ((m  -  2)  -  (r  -  2)  -  h)! 
P(i5;  >  fc)r-2-'  PdSj  =  k)'  P{iSj  -  i)h  PixSj  <  i)»-J-MH 


JV2 


J](2P(25o  =  t,  2S1  <t)  +  F(250  =  t,  25i  =  t))f  feSj  <  t)*" 

iV  —  Tiim 


i=l 


where  N2  = 


Int 


Eq.  (3.16)  and  (3.18)  are  equal  to 

P(iSo  >  i5(r_2)  =  k  >  1S1  =  iS(r_i),  gene  is  found) 


fe-l 


=    2_^{P{iSo  >  i5'(r_2)  =  *  >  1S1  =  i5(r- 1)  =  i,  2«S"o  or  2S1 


1=0 


is  uniquely  highest  in  Stage  II) 
+     P{iSq  >  i5(r_2)  =  k  >  x5i  =  i5(r_!)  =  i,  25o  =  2S1  is  the  highest) 

k-l    r-2    (m-2)-(r-2) 

=    J2Y1         Yl       {P(iS0>k,1S1>i,(r-2-l)lSJ's>k, 

i=0     1=1  h=l 

I  iS/s  —  k,h  iS/s  —  i,  (m  -  r  -  h)  \Sj's  <  i,  2So  or  2S'i 
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is  uniquely  highest  in  Stage  II) 
+    PdS0  >  k,  i5i  >=  t,  (r  -  2  -  /)  i5/s  >  fc, 

/  i5/*  =  *, h  iSj'a  =  i,  {m-r-  h)  iS/s  <  i, 2S0  =  2Si  is  the  highest)} 

k-l    r-2    (m-2)-(r-2) 

EE     E 

i=0     /=1  /i=l 


P(i50  >  fc,  i5i  >  0 


m 


2! 


(r-2  -  /)!  /!  M  ((m  -  2)  -  (r  -  2)  -  /»)! 

P(l5j   >  k)r-2-1  PdSj  =  k)1  PdSj  =  i)h  P(lS3  <  i)m-2-(r-2)-l 


N2 


J](2P(25o  =  t,  25i  <  0  +  P(2S0  =  t,  2S1  =  t))P(2Sj  <  t) 

N  —  n\m 


r-2+h 


.  (=1 


where  iV2  = 


r  +  h 


Int 


Eq.  (3.19)  and  (3.21)  are  equal  to 

P(i5(r_2)  =k>iSo>iSi>  i5(r_i),  gene  is  found) 


fe-l 


=    ^{P(i5(r_2)  =  k  >  iSo  >  1S1  >  i5(r_!)  =  i,250  or  25x 

t'=0 

is  uniquely  highest  in  Stage  II) 
+     P(i5(r_2)  =  k  >  iS0  >  i5x  >  i5(r_1}  =  *,  250  =  25iis  the  highest)} 

A:_3    r-2    (m-2)-(r-2) 

=  EE     E 

;=0   ;=i  h=i 

{P{k  >  tSf>  >  i5i  >  i,  (r  -  2  -  /)  tS/»  >  k,  I  iS/s  =  k,  h  iS/s  a  i, 
(m  -  r  -  /i)  iS/s  <  t,  25o  or  2Si  is  uniquely  highest  in  Stage  II) 
+     P{k  >  i50  >  i5i  >=  *', (r-2-  /)  iS/s  >  k, 

I  tSj's  =  k,h  iSj's  =  i,  (m-  r  -  h)  iS/s  <  i,  2S0  =  2Siis  the  highest)} 

fc-3    r-2    (m-2)-(r-2) 

=  EE     E 

t=o  ;=i         h=i 
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P(k  >  iSo  >  iSi  >  i) 


m-2! 


(r  -  2  -  /)!  /!  ft!  ((m  -  2)  -  (r  -  2)  -  ft)! 
P(iSj  >  k)r~2-1  PdSj  =  k)1  P{xSj  =  i)h  P(iSj  <  i)— 2-(r-2)-/1 
N2 

J](2P(25o  =  t,  jS,  <  0  +  P(2S0  =  t,  25j  =  t))P(2S3  <  t)r~2 

N  —  n\m 


t=i 


where  iV2 


Eq.  (3.20)  and  (3.22)  are  equal  to 


r  J  Int 


P(i5(r_2)  =  A;  >  i50  >  i5i  =  i5(P_i),  gene  is  found) 


Jfc-2 


2Z{-f>(i'5(r-2)  =  A;  >  i50  >  i5i  =  i5(r_!)  =  i,aSo  or  2Si 


i=0 

is  uniquely  highest  in  Stage  II) 
+     P{iS(r_2)  =  k  >  \So  >  iSi  —  i5(r_!)  =  r,  2-S'o  =  2S1  is  the  highest  in  Stage  II)} 

k-2    r-2    (m-2)-(r-2) 

=  EE     £ 

t=o    1=1  h=l 

{P(k  >  1S0  >  i,  1S1  =  i,  (r  -  2  -  /)  t5/«  >  fc,  /  i5/«  =  k,  h  iS/s  =  i, 

(m  —  r  -  h)  \Sj's  <  t,  2So  or  2S\  is  uniquely  highest  in  Stage  II) 
+    P(*  >  1S0  >  •',  1S1  =  i,  (r  -  2  -  /)  iS/a  >  Ik,  /  i$/a  =  fc,  /1  iS/s  =  i, 
{m  —  r  —  h)  iS/s  <  i,  25o  =  25iis  the  highest  in  Stage  II)} 

k-2.    r-2    (m-2)-(r-2) 

-  EE     £ 

t=0      Izzl  h=l 


{P{k>  iS0>i,iSi  =  i) 


m 


2! 


(r-2-l)\ll  ft!  ((m  -  2)  -  (r  -  2)  -  ft)! 

PiiSj  >  k)r~2-'  P^Sj  =  k)1  P^Sj  =  l)h  P{XSj  <  i)m-^-{r-2)~h 
N* 

J](2P(25o  =  t,  2S1  <t)  +  P(2S0  =  t,  aSi  =  0)  x  P(2SJ  <  t)r~2+h 
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where  N2  = 


N  -  n\m 
r  +  h 


Int 


Eq.  (3.23)  is  equal  to 

P{iS(r-2)  =  k>  rSo  =  1S1  >  i5(r_!),  gene  is  found) 

k-2 

=    ^{Pd5(r_2)  =  k  >  i50  =  1S1  >  i5(P_i)  =  *,  2S0  or  2Si 


t=0 


is  uniquely  highest  in  Stage  II) 
+     P{\S{r-2)  =  k  >  1S0  =  1S1  >  i5(r_!)  =  i,  250  =  25i  is  the  highest  in  Stage  II) 

k-2    r-2    (m-2)-(r-2) 

EE       E 

{P(/fc  >  1S0  =  1S1  >  *,(r  -  2  -  /)  iS/s  >  k,  I  iSj's  =  fc,  ft  i5/s  =  i, 
(m  -  r  -  h)  iS/s  <  t,  250  or  2Siis  uniquely  highest  in  Stage  II) 

+    P(k  >  iSo  =  i5\  >  i,  (r  -  2  -  /)  iSj's  >  k,  I  xSj'a  =  k,h  iSj's  =  i, 
(m-r  -  h)  iS/a  <  i, 2S0  =  25iis  the  highest  in  Stage  II)} 

k-2    r-2    (m-2)-(r-2) 

EE     E 

t=o  ;=i         /i=i 


P(fc  >  1S0  =  i^i  >  0 


m-2! 


(r-2  -  /)!  I!  h\  ((m  -  2)  -  (r  -  2)  -  /*)! 

P^Sj  >  ky-2~l  P^S,  =  k)'  PdSj  =  i)h  P{lSj  <  i)m-2-(r-2)-h 

N2 


£(2P(2S0  m  t,  2Si  <t)+  P(2S0  =  t,  25i  =  OJWj  <  t) 

N  —  uitti 


r-2 


t=l 


where  iV2  = 


r  J/nt 


Eq.  (3.24)  is  equal  to 


P(i5(r_2)  =  *  >  1S0  =  1S1  =  i$(r-i),  gene  is  found) 
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k-i 


=    ^{F(i5(r_2)  =  k  >  iSo  =  1S1  =  i5(r_1}  =  i,  2SQ  or  2Si 


i=0 


is  uniquely  highest  in  Stage  II) 
+     P(i5(r_2)  =  k  >  iSo  =  1S1  =  iS(r-i)  =  i,  2S0  =  2S1  is  the  highest  in  Stage  II) 

fc_l    r_2    (m-2)-(r-2) 

=    J2  J2         £        {P(i50  =  iS1  =  t,(r-2-Oi5j's>*r,  liSj's  =  k, 

i-0     1=1  fcsl 

/i  i5/s  =  *,  (m-r-  /i)  i5/s  <  t,250  or  2Si  is  uniquely  highest  in  Stage  II) 
+     P(i50  =  i5i  =  t,  (r  -  2  -  /)  iS/i  >  k,  I  jS/s  =  k, 

h  iSj's  =  i,  (m-r-h)  iS/s  <  i, 250  =  2Si  is  the  highest  in  Stage  II)} 

fe-l    r-2    (m-2)-(r-2) 

=  EE     E 


;=o  ;=i         /i=i 


m-2! 


|p(i50  -  i5i  -  0  (r  _  2  _  qj  „  w  ((m  _  2)  _  (r  _  2)  -  ft)! 

PdSj  >  k)r-2-'  PbSj  =  *)'  P(l5,  =  i)h  PdSj  <  i)m-2-(r-2)-h 


N2 


J](2P(250  =  t,  25i  <  0  +  P(250  =  t,  2Si  =  t))  x  P(25,  <  t) 

N  —  n\m 


r-2+h 


1=1 


where  iV2 


r  +  h 


J  Int 


Eq.  (3.25)  is  equal  to 


P(iS(r_X)  =  A;  >  i50  =  i5i  >  i5(r),  gene  is  found) 

fc-2 

^{P(iS(r_i)  -  fc  >  1S0  =  i5i  >  i5W  =  i,  250  or  25i 


i=o 


is  uniquely  highest  in  Stage  II) 
+     P(i5(r_!)  =  fc  >  i50  =  i5i  >  i5(r)  =  i,  2S0  =  2S\  is  the  highest  in  Stage  II) 

fc-2    r-1    (m-2)-(r- 1) 

=  EE     E 

i=0    (=1  h=l 
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{P{k  >  iSo  =  1S1  >  i,  (r  -  1  -  I)  iS/s  >  Jfc,  /  iS/s  =  k,  h  ,5/«  =  »', 
(m  —  r  —  h)  iS/s  <  i,  2So  or  25i  is  uniquely  highest  in  Stage  II) 
+     P(k  >  1S0  =  i5i  >  i,  (r  -  1  -  /)  ^/s  >  k,  I  x5/s  =  k, 

h  iS/s  =  i,  (m  -  r  —  h)  iS/s  <  i,  i5i  =  2-S'iis  the  highest  in  Stage  II)} 

fc — 3    r-l    (m-a)-(r-l) 

=  EE     E 

1=0  /=i         h=i 


P{k  >  iS0  =  i5i  >  i) 


m-2! 


(r-l  -  /)!  l\  h\  ((m  -  2)  -  (r  -  1)  -  /*)! 
P{iSj  >  ky-1-'  PdSj  =  k)1  P(tSj  =  i)h  PdSj  <  j)— 2-(r-D-^ 

5^(2P(250  =  t,  2S1  <t)  +  P(2S0  =  t,  2S1  =  t))  x  P{2Sj  <  t)r~l 

N  —  n\m 


t=i 


where  N2 


r+  1 


Int 


Eq.  (3.26)  is  equal  to 

P(i5(r_1)  =  k  >  1S0  =  1S1  =  iS(r),  gene  is  found) 

fc-i 
=    2_^{P(iS(r-i)  =  k  >  1S0  =  1S1  =  i5(r)  =  i,  2S0  or  25x 

i=0 

is  uniquely  highest  in  Stage  II) 
+     F(i5(r_!)  =  k  >  1S0  =  i5i  =  i5(r)  =  t,  2So  =  2S\  is  the  highest  in  Stage  II) 

Jfc-1    r-l    (m-2)-(r-l) 

=    J^         Yl       {P(iS0  =  iSl  =  i,(r-l-l)1S/s>k,l1SJ's  =  k, 

j=0     /=1  h=l 

h  iSj  s  =  i,  (m  —  r  —  h)  \Sj's  <  t,  2So  or  2S\  is  uniquely  highest  in  Stage  II) 
+     F(i50  =  1S1  =  I,  (r  -  1  -  /)  i5/«  >  *, 

/  i5j's  =  k,  h  iS/s  =  i,  (to  -  r  -  /1)  i5/«  <  i,  250  =  2Si  is  the  highest  in  Stage  II)} 
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Table  3.1.  The  joint  distribution  of  IBD  score  of  the  locus  1  and  locus  2. 


IBD  of  locus  1 

IBD  of  locus  2 

2                1                0 

2 

1 
0 

$2                      *(l-lf)                (l-*)2 
4                          2                          4 

>J(l-tf)         *2  +  (l-*)2         *(1-<I>) 
2                        2                        2 

(1-tf)2            tf(l-tf)                *2 
4                          2                          4 

Where  62  +  (1  -  B)2  and  0  is  the  recombination  fraction  between  two  loci. 


fc-1    r-l    (m-2)-(r-l) 

E  E     E    ip^0  =  i^  =  o  t^t 


m-2! 


,=o  ,=i  h=i  (r  ~  1  -  0!  «  W  ((m  -  2)  -  (r-  1)  -  h)\ 

PdSj  >  k)r-l-!  PdSj  =  k)1  P^Sj  =  z)'1  PdSj  <  ir-2-(r-i)-h 


N7 


J](2P(25o  =  t,  2Si  <t)  +  P(2S0  =  t,  2Si  =  t))  x  P{2Sj  <  t)r-1+h 


I  t=i 


where  A^2  = 


N  —  n\m 


r  +  l  +  h 


Int 


The  joint  distribution  of  the  IBD  scores  of  two  markers  (or  genes),  P(IBD  of  the 
marker  (gene)  1,  IBD  of  marker  (gene)  2)  is  given  in  Table  3.1  adapted  from  Table 
2.3  where  $=62  +  (1  -  9)2  and  6  is  the  recombination  fraction  between  two  loci. 

From  Table  3.1,  we  can  deduce  the  joint  distribution  of  X{j  and  Xx>1  for  j  ^  *'. 
The  joint  distribution  between  X{j  and  X{lj  is  shown  in  Table  3.2,  with  *  defined  as 
the  #  in  Table  3.1. 

If  a  marker  is  not  linked  with  the  disease  gene,  then  the  probabilities  of  getting 
an  IBD  score  of  the  marker  equal  to  2,  1,  and  0  are  0.25,  0.5,  and  0.25,  respectively. 
Thus,  Xij  has  a  Bernoulli  distribution  with  parameter  of  0.25.  The  distribution  of  ,Si 
under  the  null  hypotheses  is  Binomial(n,,  0.25)  for  1=1,  2.  Let  e  be  the  probability 
that  the  disease  is  caused  by  the  gene,  and  9  be  the  distance  between  marker  0  and  the 
gene.  For  equations  3.5-3.7,  where  the  recessive  gene  is  at  the  end,  Xoj  is  distributed 
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as  Bernoulli(e  #2  +  (l-e)  0.25)  and  ,S0  as  Binomial(n,,  e  V2  +  {l-e)  0.25).  Similarly, 
for  a  dominant  disease  gene,  X0j  is  distributed  as  Bernoulli(e  $/2  +  (1  —  e)  0.25)  and 
iSi  as  Binomial(n,,  e  */2  +  (1  -  e)  0.25). 

For  Eq.  3.11-3.13,  we  need  to  find  the  joint  distribution  P(/So=i,  iSi—j), 
i=0,l,...,n/,  j=0,l,...,n;,  1—1,2.  Let  na&/  be  the  number  of  (Xoj,  Xij)  =  (a, 6)  in  stage 
/.   Then  (n0o;,  "oi/,  ^io/5  ini)  has  a  multinomial  distribution  (n/,  poo>  Poi5  Pw,  Pu)- 
Thus,  the  joint  distribution 


P{lS0=  i,  f5i=  j) 

=  P(n10i  +  nu;=i,  Hon  +  l»ui=j) 

min(i,j  ) 

=  jjP  P(nm=k,  n10i=i  -  k,  nou=j  -  k, 

k  =  max(0,  i  +  j  —  n) 

nm=n  -  i  -  j  +  nm), 


(3.27) 
(3.28) 


(3.29) 


can  be  computed  once  p0o,  Poi,  Pio,  and  pn  are  specified. 

Although  we  assume  the  gene  is  in  the  middle  of  the  two  adjacent  markers,  the 
formulae  are  derived  for  the  gene  anywhere  in  between.  Let  the  recombination  fraction 
between  the  gene  and  marker  0  be  6Q  and  between  the  gene  and  marker  1  be  9\.  Let 
62  be  the  recombination  fraction  between  two  markers.  The  joint  distribution  of  X0j 
and  Xij,  for  x  =  0, 1,  y  =  0, 1,  is, 


Table  3.2.  The  joint  distribution  of  X,j  and  X,/ 


Xij 

Xffj 

1             0 

1 

0 

*2                 (1-tf2) 
4                     4 
(1-tf2)             <p2+2 
4                     4 
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P(Xoj  =  x,  XXj  =  y)  (3.30) 

=  P{X0j  =  x,  Xij  -  y  |  gene  IBD=2)P(gene  IBD=2) 
+  P{Xoj  =  x,  Xu  =  y  |  gene  IBD=l)P(gene  IBD=1) 
+  P(X0j  =  x,  Xij  =  y  |  gene  elsewhere)P(gene  elsewhere)  (3.31) 

=  P(Xoj  =  x  |  Xu ■  =  y  ,  gene  IBD=2)P(Xlj  =  y  \  gene  IBD=2) 
P(gene  IBD=2) 
+  P(X0j  =  x  |  Xxj  =  y,  gene  IBD=l)P(Xlj  =  y  \  gene  IBD=1) 

P(gene  IBD=1) 
+  P(X0j  =  x  |  X\j  =  y,  gene  elsewhere) 

P(Xij  =  y  |  gene  elsewhere)P(  gene  elsewhere).  (3.32) 

Since  an  individual  has  to  have  two  recessive  disease  genes  in  order  to  be  affected, 
Eq.  (3.32)  becomes 

=    P{X0j  =  x  |  Xu  =  y,  gene  IBD=2)P(*lj  =  y  |  gene  IBD=2)e 

+  P(X0j  =  x  |  Xij  =  y,  gene  elsewhere) P{Xi,  =  y  \  gene  elsewhere)(l  -  e). 

For  a  dominant  disease,  if  an  ASP  is  caused  by  a  gene,  then  both  sibs  must  at  least 
share  the  disease  gene  and  there  is  a  50-50  chance  they  share  the  other  allele.  Thus, 
Eq.  (3.32)  becomes 

=  P(Xoj  =  x  |  Xu  =  y,  gene  IBD=2)P(A'lj  =  y  \  gene  IBD=2)(e/2) 
+  P(Xoj  =  x  |  Xu  =  y,  gene  IBD=l)P(Xli  =  y  |  gene  IBD=l)(e/2) 
+  P(X0j  =  x  |  Xij  =  y,  gene  elsewhere) 

^(^ij  =  V  I  gene  elsewhere)(l  -  e).  (3.33) 

Let  f  |.  8?  +  (1  -  9{)\  /=0,1,2,  pxy  =  P(Xoi  =  x  |  Xh  =  y),  x  «  0,1,  y  .  0,1.  Then 
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apply  Table  3.2  for  a  recessive  disease  we  have 

Poo    =  (l-*g)(l-*f)£  +  JS_t2(l-c),  (3.34) 

Pio    =  tKl-t?)e  +  i^.(l-«),  (3-35) 

Poi    =  (1-«5)«J  «  +  -^-(l-«),  (3-36) 

Pn    =  *l*l  e  +  ^(l-e).  (3.37) 
For  a  dominant  disease, 

Poo    =    (1-«8)(1-«J)  |  +  (l-»o+t8)(l-ti+f?)|+-i±i(l-e)1  (3.38) 


Pl0    =    ^(l-«?)|+'lo(^«o)(l-ti+«?)|  +  -^^(l-«),  (3-39) 

poi    =    (l-t2)*;e  +  (l-t.+»J)t1(l-»i)|  +  ii^2-(],-«),  (3-40) 

pi,  -  *:  *i  | + »o<i-*o)»i(i-«i)  | + ^  (i  - «)-  (3-41) 

Thus,  P(the  marker  closest  to  the  gene  is  found)  can  be  calculated.    In  addition  to 
analysis  computation  for  Eq.  3.3  and  3.4,  simulations  were  also  done. 

3.3.2     Simulation  under  More  Realistic  Assumptions 

The  assumption  of  independent  Xy  violates  the  fact  that  some  loci  are  linked. 
Simulations  were  done  under  a  Markov  chain  model  with  the  combinations  of  resource 
N  =  1000,  2000,  5000,  and  10,000,  number  of  ASP  n  =  10,  25,  75,  and  100,  e  =  0.25, 
0.5,  0.75,  and  1,  m  from  50  to  350  with  an  increment  25  for  two-stage  design  and  10 
for  one-stage  design,  r  from  5  to  m  with  an  increment  (m  —  5)/10,  and  ni  from  5  to 

n  —  5  with  increment  5. 

The  simulations  were  conducted  as  follows: 
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1.  Reading  in  parameters,  resource  N,  heterogeneity  s,  number  of  ASP  n,  and  m, 
ill,  and  r  of  the  design. 

2.  Given  ni,  m,  generate  y  from  uniform(0,l).  If  y  <  e  then  generate  gene  location 
from  uniform  distribution  (0,3300).  Let  locus  /  and  locus  /  +  1  be  two  adjacent 

loci,  i.e.  the  gene  location  is  between  P*  and  »  Zj  ,  where  L  is  the  total  length 
of  genome.  This  step  determines  which  interval  contains  the  gene  and  then: 

(a)  For  recessive  disease,  generate  IBD  scores  at  loci  /  and  /  +  1  conditional 
on  both  sibs  of  ASP  carrying  two  disease  genes  and  the  gene  locus  is  in 
the  middle  of  two  markers.  Haldane  map  function  is  used  to  convert  map 
distance  into  recombination  fraction  9. 

(b)  For  dominant  disease,  if  y  <  0.5e,  generate  IBD  scores  at  loci  /  and  /  +  1 
conditional  on  both  sibs  of  ASP  sharing  two  alleles  at  gene  loci  and  the 
gene  locus  is  in  the  middle  of  two  markers.  If  0.5e  <  y  <  e,  then  generate 
IBD  score  at  loci  /  and  /  +  1  conditional  on  both  sibs  of  ASP  share  one  allele 
at  gene  loci  and,  again,  the  gene  locus  is  in  the  middle  of  two  markers. 

If  y  >  e,  let  I  =0,  and  generate  IBD  score  at  locus  0  from  Bernoulli(0.25)  as  if 
there  is  no  gene  linked  with  marker  0. 

3.  For  Markov  chain  model  simulation,  generate  IBD  score  at  locus  i,  i  start  from 
/  — 1  and  decrease  to  0,  conditional  on  IBD  score  at  locus  i-f-1,  and  then  generate 
IBD  score  at  locus  i  conditional  on  IBD  score  at  locus  i  —  1  for  i  —  I  +  1,  ..., 
m  —  1.  For  independent  model  simulation,  generate  IBD  scores  independently 
from  Trinomial(rci,  0.25,  0.5,0.25).  Conditional  probability  formulas  are  given 
by  Table  2.3. 

4.  Convert  IBD  scores  into  statistics  X-,s. 

5.  Repeat  step  2,  3,  and  4  nx  times  and  then  calculated  iS{. 
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6.  Check  ties,  adjust  R  to  include  all  the  ties  with  the  rth  highest  \S{  for  Stage 
II,  check  whether  \Sq  passes  Stage  I.  If  yes,  then  calculate  A^  according  to  the 
resource  constraint  and  go  to  the  next  step.  If  not,  record  the  detection  as  a 
failure  and  go  back  to  step  2. 

7.  The  XijS  in  Stage  II  are  generated  the  same  way  as  those  in  Stage  I,  except 
only  the  chosen  markers  are  used  and  repeat  N2  times. 

8.  Check  whether  250  or  2S1  is  the  unique  largest  among  2SiS  in  Stage  II. 

9.  If  marker  0  or  1  is  chosen,  it  is  a  success;  otherwise  it  is  a  failure.  Go  back  to 
step  2  until  enough  simulation  has  been  done. 

The  programs  were  compiled  using  GCC  version  2.7.2  on  a  PC  with  Pentium  166 
CPU  and  48M  RAM  in  an  OS/2  environment.  The  random  number  generator  for  the 
simulation  was  adapted  from  Press  (1992). 

3.3.3     Results 

Based  on  analytic  computation  and  simulation,  the  designs  with  the  highest  prob- 
ability of  finding  the  right  marker  (power)  were  identified.  These  optimal  designs  are 
given  in  Table  3.3  for  searching  a  recessive  gene,  and  Table  3.4  for  a  dominant  gene.  In 
both  tables,  in  the  ASP  column  is  reported  the  number  of  available  affected-sib-pairs; 
the  e  column  shows  the  probability  that  the  disease  is  caused  by  the  gene,  repre- 
senting heterogeneity;  the  m  columns,  the  number  of  marker  loci  used  in  the  first 
stage;  the  r  column,  the  proposed  number  of  loci  to  be  chosen  in  Stage  I  for  Stage 
II  study;  the  nx  columns,  the  number  of  ASP  used  in  first  stage;  the  F2  column, 
the  probability  of  locating  the  right  marker  by  the  best  two-stage  design  obtained  by 
analytic  formula;  the  Indep  columns  and  the  Markov  columns  show  the  probabilities 
obtained  by  simulation  with  independent  assumption  and  Markov  chain  assumption 
without  combining  first-  and  second-stage  data;  and  the  Comb,    column  shows  the 
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simulated  probability  of  the  two-stage  design  with  Markov  chain  model  with  first- 
and  second-stage  data  combined.  The  Fl  column  gives  the  probability  of  the  best 
one-stage  design,  i.e.,  with  optimal  m  and  n  subject  to  mn  <  N  obtained  by  analytic 
formula.  The  last  column,  F2— Fl,  shows  the  increase  in  probability  of  two-stage  de- 
sign over  one-stage.  An  asterisk  is  marked  when  the  increase  was  over  0.35  for  Table 
3.3  (recessive)  and  0.15  for  Table  3.4  (dominant). 
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3.4     Type  I  Error  and  Power  of  Claiming  Linkage 

Tables  3.3  and  3.4  give  the  optimal  designs  and  the  probabilities  of  finding  the 
locus  linked  to  the  responsible  gene  when  the  gene  exists,  but  they  do  not  provide 
information  on  the  probability  of  causing  a  false  conclusion  when  there  is  no  gene 
responsible  for  the  disease.  The  usual  requirement  is  that  the  LOD  score  in  Stage  II 
should  be  greater  than  a  certain  threshold,  t,  in  order  to  claim  linkage.  Given  R  =  r, 
we  let  the  threshold  be  t,  which  is  the  lOO(l-a)  percentile  of  the  unique  maximum, 
T,  of  r  Binomial(n2,  0.25)  random  variables.  The  probability  mass  function  of  T  is 

P(T  =  s) 

=    P(2SiQ  =  s\2Si0  is  the  unique  maximum) 

P(2Si0  =  s,2Si0  is  the  unique  maximum) 
P(2Si0is  the  unique  maximum) 

rf(s)F(s-l)r-1 


I&rmFia-lY-* 

The  probability  mass  function  of  the  marker  group  is 

P(T  =  s) 

=    P{2Si0  =  2Si0+i  =  5 ^/o  and  2570  are  the  unique  maximum) 

P(2Si0  =  2Si0+i  —  s,2SiQ\s  the  unique  maximum) 
P{2Si0  and  25/0+1  are  the  unique  maximum) 


r-2 


(r-l)f(s)*F{8-l) 

ZZAr-i)mF(s-iy-^ 

Where  f(s)  and  F(s)  are  probability  mass  function  and  cumulative  distribution 
function  of  Binomial(n2,  0.25).  Table  3.5  and  3.6  gives  the  value  of  t  of  the  unique 
maximum  and  the  marker  group.  To  use  Table  3.5  and  3.6,  first  find  your  n2  in  the 
n2  column,  then  find  your  R  in  R  column,  if  your  R  is  not  in  the  table,  then  find  the 
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largest  value  R  that  is  smaller  than  your  R  with  same  n2.  The  value  in  the  t  column 
is  the  95%  percentile.  The  range  of  n2  is  from  5  to  95,  and  for  R  is  from  5  to  300. 

A  linkage  can  be  claimed  if  and  only  if  2S/0  >  t,  where  I0  is  one  of  the  markers 
that  was  chosen  in  Stage  II.  Clearly, 

P(linkage  claim  is  incorrect) 

<  P(2SiQ  >  t   |  no  gene  is  responsible) jP(no  gene  is  responsible) 
+P(I0  is  wrong  and  2S70  >  *  |  a  gene  is  responsible) 
P(a  gene  is  responsible). 

The  (prior)  P(no  gene  is  responsible)  is  usually  unknown,  but  we  can  conclude  that 
P(/0  is  wrong  and  2S/0  >t\&  gene  is  responsible)  <  P(2S/0  >  *  I  no  gene  is  respon- 
sible). A  proof  is  as  follows.  Let  A  denote  the  event  {  the  gene  is  between  marker  i0 
and  marker  i0  +  1}. 

P(I0  is  wrong  and  2Si0  >  t  \  a.  gene  is  responsible)  (3-42) 

=  P{2SIo  >  2Sj,  Vj  #  /„,  2SIo  >  t  and  /„  /  i0  ovio  +  l  \A)  (3.43) 

=   Y,  P(*  =  25/0  >  2Sj,  Vj  ^  /<,,  and  /o  +  i0  or  i0  +  1  \A)  (3.44) 

k=t+l 
n2 


=  Y  P(k  >  a^ii  V?  ^  ^°'*0'  or  'o  +  1»  *  >  aSfei  *  >  25,l0+i, 

fcst+1 

257o  =  fc  and  /„  /  to  or  i0  +  1  |A)  (3.45) 

=    ^  {P(fc  >  ,5y,  Vj  ^  /o,  H>,  or  i0  +  1,  /0  +  i0  or  i0  +  1  \A)  (3.46) 

P(k  >  2Sl0,  k  >  2St0+l\A)  (3.47) 

^(25/0  =  ■*,  /o  9^  ^o  or  i0  +  1  |A)>,  (3.48) 
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Table  3.5.  The  95%  percentile  of  the  unique  maximum  of  R  Binomial(n2,  0.25). 


n2 

R 

1 

"2 

R 

t 

n2 

R 

t 

n2 

R 

( 

mj 
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t 

"2 

R 

( 

5 

5 

5 

28 

44 

16 

45 

20 

21 

59 

145 

28 

72 

255 

33 

84 

266 

37 

5 

24 

5* 

28 

149 

17 

45 

46 

22 

60 

5 

23 

73 

5 

27 

85 

5 

30 

6 

5 

5 

29 

5 

14 

45 

120 

23 

60 

6 

24 

73 

7 

28 

85 

6 

31 

6 

7 

1 

29 

11 

15 

46 

5 

19 

60 

12 

25 

73 

12 

29 

85 

9 

32 

6 

84 

6* 

29 

29 

16 

46 

7 

20 

60 

22 

26 

73 

22 

30 

85 

15 

33 

7 

5 

6 

29 

92 

17 

46 

15 

21 

60 

48 

27 

73 

42 

31 

85 

26 

34 

7 

22 

7 

30 

5 

14 

46 

34 

22 

60 

108 

28 

73 

88 

32 

85 

50 

35 

8 

5 

6 

30 

8 

15 

46 

85 

23 

60 

264 

29 

73 

193 

33 

85 

99 

36 

8 

9 

7 

30 

20 

16 

46 

233 

24 

61 

5 

24 

74 

5 

27 

85 

207 

37 

8 

69 

8 

30 

60 

17 

47 

5 

19 

61 

10 

25 

74 

6 

28 

86 

5 

31 

9 

5 

7 

30 

202 

18 

47 

6 

20 

61 

18 

26 

74 

10 

29 

86 

8 

32 

9 

24 

8 

31 

5 

14 

47 

12 

21 

61 

37 

27 

74 

18 

30 

86 

13 

33 

9 

225 

9 

31 

1 

15 

47 

26 

22 

61 

82 

28 

74 

34 

31 

86 

22 
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10 
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7 

31 

15 
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74 
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79 

36 

10 

70 

9 
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18 

48 

5 

20 

62 

8 

25 

75 

5 

28 

86 

162 

37 

11 

5 

7 

32 

5 

15 

48 

10 

21 

62 

15 

26 

75 

9 

29 

87 

5 

31 

11 

6 

8 

32 

11 

16 

48 

20 

22 

62 

29 

27 

75 

15 

30 

87 

7 

32 

11 

29 

9 

32 

28 

17 

48 

46 

23 

62 

63 

28 

75 

28 

31 

87 

11 

33 

11 

216 

10 

32 

81 

18 

48 

115 

24 

62 

144 

29 

75 

55 

32 

87 

18 

34 

12 
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8 

32 

273 

19 

49 
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20 
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75 
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Table   3.6.      The   95%   percentile  of  the   unique   maximum  marker  group   of  R 
Binomial(n2,  0.25). 
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P^Sio  >  t   |  no  gene  is  responsible)  (3.49) 

n2 

=    22  {^(k  >  2^i'  •?  ^  ^°'  *°  or  *°  +  *  an<^  ^°  ^  *°  or  ?o  +  1  I  no  gene)(3.50) 
fcst+i 

P(k  >  2SiQ,  k  >  2Si0+i\  no  gene)  (3.51) 

P(2Si0  =  k  and  70  7^  «o  or  »o  +  1  I  no  gene)}.  (3.52) 

Eq.  (3.47)  is  less  than  Eq.  3.51  and  the  other  corresponding  terms  are  equal,  therefore 

P(Io  is  wrong  and  2570  >  /    |  a  gene  is  responsible) 
<     PfaSh  >  t   I  no  gene  is  responsible) 

and  thus, 

P(linkage  claim  is  incorrect) 
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all  possible  results      ^        1=0 
of  the  stage  I 


i=0 


P(the  result  of  stage  I) 


3.5     Discussion 

The  Monte  Carlo  results  indicated  that  the  relative  errors  between  the  probabil- 
ities calculated  from  formulas  and  Markov  chain  simulations  were  under  7%  in  the 
dominant  cases  and  under  15%  in  the  recessive  cases.  Consequently,  the  approxima- 
tion using  the  independence  assumption  for  dependent  marker  loci  was  acceptable. 

The  simulation  studies  also  showed  that,  in  the  dominant  cases,  combining  Stage 
I  data  with  Stage  II  data  did  not  have  any  significant  advantage.    The  probability 
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of  allocating  the  correct  marker  increased  less  than  by  3%.  For  the  recessive  cases, 
there  was  some  advantage  when  the  ASPs  were  few.  The  probability  of  allocating 
the  correct  marker  could  increase  by  as  much  as  10%.  However,  it  is  very  diffcult  to 
combine  data  in  the  theoretical  derivation. 

The  powers  to  find  the  gene  depends  on  the  exact  gene  location  between  markers. 
Since  Table  3.3  and  3.4  were  constructed  under  the  least  favorable  configuration,  the 
actual  power  should  be  higher. 

The  two-stage  approach  indeed  boosted  the  probability  of  finding  the  correct  gene 
location  under  resource  constraints.  In  many  cases  this  probability  can  increase  up  to 
20%  to  30%.  In  searching  for  recessive  disease  genes,  there  were  several  instances  when 
the  improvement  exceeded  35%  (Table  3.3).  However,  if  there  are  enough  resources, 
such  that  almost  all  available  markers  can  be  typed  on  all  ASPs,  then  the  one-stage 
approach  may  have  higher  power,  (see  e.g.,  N=5000  and  APS=10  or  N  =10,000  and 
ASP=10,  25  in  Table  3.3.)  However,  since  the  power  loss  is  so  small,  we  may  always 
choose  the  optimal  two-stage  design. 

As  shown  in  Tables  3.3  and  3.4,  it  requires  much  more  resources  to  locate  a 
dominant  than  a  recessive  disease  gene.  For  example,  in  a  two-stage  design  with 
25  ASPs  and  N=1000  we  can  locate  a  recessive  disease  gene  with  a  0.88  probability 
when  £  =  1.  But  for  a  dominant  disease  gene,  more  than  100  ASPs  and/or  more  than 
N= 10,000  are  needed  to  achieve  the  same  probability. 

Another  point  worth  noting  is  that  phenocopy  can  severely  reduce  the  probability 
of  finding  correct  gene  location.  For  example,  to  locate  a  recessive  gene,  with  resource 
#=5000,  25  ASPs,  and  e  =  1,  the  chance  of  finding  it  is  99.9%.  However,  when 
6  =  0.5,  even  if  we  double  the  resource  to  #=10,000  and  50  ASPs,  the  chance  of 
finding  the  correct  locus  is  only  90%.  Althought  both  of  them  have  about  25  ASPs 
whose  disease  is  caused  by  gene,  those  extra  phenocopy  ASPs  reduce  the  probability 
considerably. 


CHAPTER  4 
TWO-STAGE  GENOME  SEARCH  FOR  COMPLEX  DISEASE 


In  the  previous  chapter,  we  have  focused  on  finding  a  single  disease  gene.  However, 
genetic  diseases  are  not  always  single-gene  diseases.  Many  of  them  are  complex  dis- 
eases, i.e.,  diseases  caused  by  several  genes.  For  example,  insulin-dependent  diabetes 
mellitus  (IDDM)  is  influenced  by  a  number  of  susceptibility  genes  and  environmental 
factors  (Luo  et  al.,  1995).  A  disease  phenotype  controlled  by  genes  at  several  dif- 
ferent loci  is  considered  to  have  "nonallelic  heterogeneity"  and  "when  this  disease  is 
relatively  rare  and  mutation  rates  are  low,  individuals  within  a  family  are  generally 
homogeneous.  Locus  heterogeneity  then  leads  to  the  situation  that  the  recombination 
fraction  between  disease  phenotype  and  marker  will  be  different  in  different  families" 
(Ott,  1991,  p.  199).  This  chapter  discusses  the  probability  to  find  two  unlinked  reces- 
sive genes  that  may  cause  the  same  disease  in  a  two-stage  search  under  the  following 
genetic  model. 

4.1     Genetic  Model  and  Assumptions 

Throughout  this  chapter,  we  assumed  the  following: 

•  No  epistasis,  i.e.  no  interaction  between  genes. 

•  When  an  individual  carries  the  two  disease  genes,  the  penetrance  is  additive, 
i.e.  the  probability  of  being  affected  for  an  individual  who  has  two  recessive 
genes  at  two  loci  is  twice  as  high  as  for  an  individual  who  has  two  recessive 
genes  at  only  one  locus. 
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•  For  an  individual  who  has  no  disease  gene,  the  probability  of  being  affected  is  fco 
times  the  probability  of  being  affected  for  an  individual  who  has  two  recessive 
genes  at  one  locus. 

•  Receiving  one  gene  has  no  effect  on  receiving  other  genes. 

Other  assumptions,  necessary  for  simplifying  analytic  study  are  similar  to  those  in 
the  previous  chapter,  are: 

•  There  are  two  alleles  at  each  gene  locus,  denote  disease  gene  d  and  normal  gene 
D.  The  population  frequency  of  disease  gene  is  p  for  both  loci. 

•  Genes  are  in  the  middle  of  two  adjacent  markers. 

•  In  this  study,  only  three  maps,  5  cM  (centimorgan),  10  cM,  and  20  cM,  are 
available,  and  every  marker  are  highly  polymorphic.  Marker's  positions  on  the 
each  chromosome  are  at  0  cM,  5  cM,  10  cM,...,  for  5  cM  map,  0  cM,  10  cM,..., 
for  lOcM  map,  and  0  cM,  20  cM,  40  cM,...,  for  20  cM  map. 

•  The  cost  of  typing  alleles  is  a  constant. 

4.2     Two-stage  Genome  Search 

For  simplicity,  we  use  a  slightly  different  two-stage  approach  for  searching  complex 
disease  genes.  In  the  first  stage,  we  choose  a  threshold  for  statistics  instead  of  choosing 
a  number  of  loci.  In  the  second  stage,  we  choose  the  loci  on  different  chromosomes 
where  the  statistics  have  the  highest  overall  value.  Details  are  as  follows: 

Following  chapter  3,  suppose  there  are  n  ASPs  and  there  are  enough  resources  to 
type  N  markers  for  ASPs.  Again,  there  are  three  numbers  that  must  be  determined 
in  a  two-stage  design:  n\  and  m,  the  number  of  ASPs  and  the  number  of  markers  to 
be  used  in  Stage  I,  and  k,  the  threshold  of  markers  to  be  studied  in  Stage  II.  Define 
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in  the  same  way  as  in  chapter  3  where  i  is  the  index  of  loci,  j  is  the  index  of  ASP, 
Xij  is  defined  by  the  assumption  6  in  §3.1.  If  in  Stage  I,  marker  i  has  a  iS,-  value 
higher  than  the  threshold,  k,  we  will  study  this  marker  again  in  the  stage  II.  Let  the 
number  of  markers  that  passed  Stage  I  be  R. 

In  Stage  II,  R  markers  on  N2  ASPs  are  to  be  typed,  where  N2  is  the  largest  number 
subject  to  the  resource  constraint  and  R.  Since  R  is  a  random  variable,  N2  is  also  a 
random  variable.  Thus,  N2  is  the  largest  2,  such  that  mni  +  Rx  <  N.  We  define 

ni+N2 
J-ni+l 

for  stage  II.  Then,  markers  that  meet  the  following  criteria  will  be  declared  having  a 
gene  nearby: 

•  A  marker  with  the  uniquely  highest  2S{  is  claimed  to  be  the  marker  nearest  to 
the  disease  gene. 

•  A  marker  group  (see  page  42),  has  the  uniquely  highest  score,  is  claimed  to  have 
the  gene  lie  between  them, 

•  If  two  markers  or  marker  groups  have  the  same  highest  score  but  on  different 
chromosomes,  then  declare  each  having  a  gene  nearby  in  the  corresponding  way 
in  the  above. 

If  none  of  the  above  applies,  then  gene  location  is  considered  undetermined. 

Once  the  location(s)  has(have)  been  chosen,  same  as  in  the  chapter  3,  the  next 
step  is  to  check  whether  we  can  claim  linkage.  Let  t  be  the  100(1  -  a)  percentile  of  the 
maximum  of  r  binomial  (n2,  0.25)  random  variables.  If  25,  of  the  chosen  location(s) 
is(are)  greater  than  t,  then  we  claim  there  is  linkage  at  that  location(s).  Since  we 
imposed  some  restriction  on  declaring  a  maker  having  a  gene  nearby,  the  actual  type 
I  error  will  be  less  than  a.  The  95%  percentile  for  the  unique  maximum  and  marker 
group  were  given  in  Table  3.5  and  3.6,  respectively. 
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4.3     Probability  of  Allocating  the  Correct  Marker  for  a  Complex  Disease 

In  this  section  we  discuss  an  analytical  approach  and  some  simulation  results. 
§4.3.1  is  a  preparation  for  the  discussion. 

4.3.1     Possible  Parental  Genotypes  and  Trait  IBP  distribution 

For  a  disease  gene  locus,  say  locus  t,  let  d  and  D  denote  the  disease  and  the  normal 
gene  allele  respectively.  Let  also  the  population  frequency  of  d  be  p,  and  PG  denote 
the  parental  genotype.  Let 

Ei  =  (en,ei2)  = 

(1,1),  if  both  sibs  received  two  recessive  genes  at  locus  i 

(1,0),  if  sib  1  received  two  recessive  genes  and  sib  2  did  not  at  locus  i 

(0, 1),  if  sib  2  received  two  recessive  genes  and  sib  1  did  not  at  locus  i 

(0, 0),  if  none  of  sibs  received  two  recessive  genes  at  locus  i. 

Then  the  distribution  of  trait  IBD,  It,  conditional  on  parental  genotype  and  £,  is 
given  in  Table  4.1.  The  numbers  in  the  table  is  very  easy  to  verify,  for  example, 
for  parental  genotype  2,  if  the  order  of  the  chromosomes  are  specified,  say  1,  2,  3, 
and  4  where  1  and  2  belong  to  father,  and  3  and  4  belong  to  mother,  then  the 
probability  of  receiving  Dd  dd,  where  D  is  on  the  chromosome  4,  is  p3(l  -  p).  If  we 
permute  the  order  of  chromosomes  we  will  get  4  different  permutation,  therefore,  the 
probability  of  parents  having  genotype  Dd  and  dd  is  4p3(l  -  p)  in  a  random  mating 
population.  When  parental  genotype  is  Dd  and  dd,  possible  sib  genotype  are  Dd 
and  dd  with  equal  frequency,  therefore,  P(£t=(l,  1)|PG=2)=  P(£,=(1,0)|PG=2)= 
P(£=(0,1)|PG=2)=P(£,=(0,0)|PG=2)=  }■  In  order  to  illustrate  the  conditional 
distribution  of  trait  IBD  score,  subscripts  for  parental  genotype  are  denoted  as  dxd2 
and  Dd3.    When  parental  genotype  is  dd  and  Dd,  and  E{  =  (0,0),  sib's  genotype 
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can  only  be  Dd\  or  Dd2  with  equal  chance,  therefore,  P(It=2\PG=2,Ei—(0,0))  — 
P(/t=l|PG=2,£?,=(0,0))=|. 

4.3.2     Analytic  Approach 

Assume  there  are  2  unlinked  genes,  namely  G\  and  G2.  Without  loss  generality, 
let  Xij  and  X2j  be  the  statistics  of  the  markers  next  to  G\,  and  X3j  and  X±j  be  the 
statistics  of  the  markers  next  to  G2  for  the  ASP  j;  /Si,  /S2,  /S3,  and  /S4  be  the  sum 
of  Xs  respectively  in  Stage  /,  as  defined  at  the  beginning  of  this  section.  Also,  let  7\ 
and  I2  be  the  IBD  score  of  G\  and  G2  for  the  affected  sibs  pair.  Let  the  penetrance  of 
carrying  two  recessive  gene  at  one  locus  is  A,  two  loci  is  2A,  and  none  is  A;0A,  where  ko 
is  the  relative  risk  of  being  affected  for  an  individual  carrying  no  gene  to  an  individual 
carrying  two  recessive  genes  at  one  locus.  The  A  is  unknown  but  is  assumed  to  be 
small.  It  will  be  cancel  out  in  the  formula.  Then 

P(a  gene  or  both  genes  are  found) 

=     P(G  1  is  found  and  G2  is  not) 
+P(G2  is  found  and  Gi  is  not) 
+P{G\  and  G2  are  found) 

=     {P{Gi  passes  Stage  I,  G2  does  not  pass  Stage  I,  d  is  found)         (4.3) 
+P{Gi  and  G2  pass  Stage  I,  G\  is  found)}  (4.4) 

+{P(G2  passes  Stage  I,  Gi  does  not  pass  I,G2  is  found)  (4.5) 

+  P(Gi  and  G2  pass  Stage  I,  G2  is  found)}  (4.6) 

+P(Gi  and  G2  pass  Stage  I,  G\  and  G2  are  found)  (4.7) 

For  a  given  threshold  k,  a  given  event  {G\  is  found  but  G2  is  not},  the  possible 
relationship  between  k  and  1S1,  iS2,  1S3,  and  1S4  in  Stage  I,  and  corresponding 
relationship  of  2Si,  2S2,  2S3,  and  2S4  in  Stage  II  are  given  in  Table  4.2.  For  example 
for  event  6,  given  Gi  is  found  but  G2  is  not,  in  Stage  I  jSi  >  k,  jS2  >  k,  1S3  <  k 
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Table  4.2.  Exclusive  events  for  the  case  "Cn  is  found  but  G2  is  not." 


event 

Stage  I 

Stage  II 

iS, 

1S2 

I  S3 

IS4 

the  largest  statistics  and  relationship 

1 

> 

> 

< 

< 

max(25'i,  2S2) 

2 

> 

< 

< 

< 

2S1 

3 

< 

> 

< 

< 

2S2 

4 

> 

> 

> 

> 

max(25'i,  2-52),  and  >  max(2S,3,  254) 

5 

> 

> 

> 

< 

max(2S'i,  2^2),  and  >  2^3 

6 

> 

> 

< 

> 

max(2S'i,  2S2),  an(i  >  2S4 

7 

> 

< 

> 

> 

2S1,  and  >  max(25,3,  254) 

8 

< 

> 

> 

> 

2^2,  and  >  max(25'3,  2S4) 

9 

> 

< 

> 

< 

2S\  and  >  2S3 

10 

> 

< 

< 

> 

2Si,  and  >  2S4 

11 

< 

> 

> 

< 

2S2,  and  >  2'?3 

12 

< 

> 

< 

> 

2S2,  and  >  2S4 

and  164  >  k,  then  in  the  Stage  II,  the  maximum  of  251  and  2^2  must  have  the  largest 
value  and  also  larger  than  2S4. 

For  a  given  threshold  A;,  and  event  {G2  is  found  but  d  is  not}  the  possible 
relationship  between  k  and  i^i,  1^2,  1S3,  and  1S4  in  Stage  I,  and  corresponding 
relationship  in  Stage  II  are  given  in  Table  4.3. 

For  a  given  threshold  k,  and  event  {G\  and  G2  are  found}  the  possible  relationship 
between  k  and  iSi,  1S2,  1S3,  and  1S4  in  Stage  I,  and  corresponding  relationship  in 
Stage  II  are  given  in  Table  4.4. 

For  a  given  threshold  k,  the  sum  of  probabilities  (4.3)  through  (4.7)  are 


m-4    33      ./V2 

>  4  /     /     ^(event  i,  and  /  ySj  pass  Stage  I,  the  largest  statistics  =h  in  Stage  II  ), 

/=0    i=l    h=l 


85 


Table  4.3.  Exclusive  events  for  the  case  "G2  is  found  but  Gi  is  no." 


event 

Stage  I 

Stage  II 

1S1 

1S2 

iS3 

IS4 

the  largest  statistics  and  relationship 

13 

< 

< 

> 

> 

max(25,3,  2S4) 

14 

< 

< 

> 

< 

2-53 

15 

< 

< 

< 

> 

2S4 

16 

> 

> 

> 

> 

max(25,3,  2S4),  and  >  max(25'i,  2S2) 

17 

> 

< 

> 

> 

max(2S,3,  2S4),  and  >  25i 

18 

< 

> 

> 

> 

max(2S,3,  2S4),  and  >  25'2 

19 

> 

> 

> 

< 

253,  and  >  max(25i,  252) 

20 

> 

> 

< 

> 

2S4,  and  >  max(25'i,  252) 

21 

> 

< 

> 

< 

2^3,  and  >  25i 

22 

> 

< 

< 

> 

2^4,  and  >  2Si 

23 

< 

> 

> 

< 

253,  and  >  2S'2 

24 

< 

> 

< 

> 

2S4,  and  >  2S2 

Table  4.4.  Exclusive  events  for  the  case  "(?i  and  G2  are  found." 


event 

Stage  I 

Stage  II 

1S1 

1 S2     1S3 

1S4 

the  largest  statistics  and  relationship 

25 

> 

>       > 

> 

max(25,3,  254)=  max(25'i,  ^2) 

26 

> 

<       > 

> 

max(25,3,  2^4)=  2S1 

27 

< 

>       > 

> 

max(25,3,  2"S,4)=  2^2 

28 

> 

>       > 

< 

max(2S,i,  2^)=  2S3 

29 

> 

>       < 

> 

max(2.S'i,  252)=  2S4 

30 

> 

<       > 

< 

2'S'3=2>S'i 

31 

> 

<       < 

> 

254=2'S'l 

32 

< 

>       > 

< 

2«S'3=  2'5'2 

33 

< 

>       < 

> 

204=202 

86 

where  N2  =  [*=**]  M . 

For  example  for  event  25; 

P(event  25,  and  /  iSjS  pass  Stage  I) 

max(25,3,  254)=  max(25i,  2S2)=h,  and  all  other  /  2Sjs  <  /1) 
(with  assumption  of  independence) 
=    P(\Si  >  k,  1S2  >  k,  1S3  >  k,  1S4  >  k)P(l  \Sj  pass  Stage  I) 
P(max(25,3,  2'S'4)=  max(25'i,  2S'2)=/i)P(all  other  /  2SjS  <  h), 

where  j  ^  1,2,3,4.  The  distribution  of  2Sj  is  Binomial(iV2,  0.25).  Therefore,  if  we 
want  to  know  probabilities  (4.3)  through  (4.7),  we  need  to  know  the  joint  distribution 
of  1S1,  1S2,  1S3  and  1S4  conditional  on  both  sibs  are  affected.  In  order  to  calculate 
this  joint  distribution,  we  need  to  know  the  joint  distribution  of  X\j  and  X2j  and  X3j 
and  X4J  conditional  on  both  sibs  are  affected,  which  is, 

Pxyuv  (4.8) 

=    P(X\j  =  x,X2j  =  y,X3j  =  u,X4j  —  v  I  both  affected)  (4.9) 

2      2 
=     X,  2-/  P^li  ~  x'  ^2j  =  y,  Xzj  =  u,  X4j  =  v  Ii  =  t'i,  /2  =  la  |  both  affected) 

i1=0»'2=0 

2      2 


=    2_^  2_^{P(Xij  =  x,X2j  =  y,X3j  =  u,X4j  =  v  \Ii  =  z'j,  I2  =  t3,  both  affected) 
n  =0)2=0 

P(h  =  iu  I2  =  i2  I  both  affected)}  (4.10) 

2      2 

=       E  £  P^   =  *•  ^   =  y  I   71   =  W**  =  U,  X,,   =  v\I2  =  l2) 
i1=0i2=0 

P(/i  =  ii,  /3  =  »2  j  both  affected)  (4.11) 
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Table  4.5.  Conditional  distribution  of  Xij  and  X2j  given  /j. 


i 

X 

y 

P{Xl}  =  x,  Xij  =  y\I1  =  i) 

2 

0 
1 
0 
1 

0 
0 

l 
l 

(1 -•?)(!-•» 
(l-tj)t5 

1 

0 
1 
0 

1 

0 
0 

1 
1 

»i(l-»i)(l  -*a  +  *|) 

(1  -  *i  +  *?)*s(l  -  *2) 

*i(l  -  *i)*2(l  -  *3) 

0 

j — 

0 

1 

0 

1 

0 

0 

1 
1 

(2*,  -*f)(2fa-*l) 

(2«1-*J)(l-*a)a 
(1  -  ^x)2(l  -  *2)2 

The  probability  P(Xij  =  x,  X2j  =  y  \  h  =  i)  in  general  case,  i.e.  gene  is 
anywhere  in  between  2  markers,  can  be  derived  from  Table  2.3,  and  is  given  in  Table 
4.5.  Where  tfj  =  ^2  +  (l-^)2,  j  =  l,2,  Bx  and  92  are  the  recombination  fraction  between 
gene  and  marker  1,  and  between  gene  and  marker  2,  respectively.  The  probability 
P{X$j  =  x,  X4j  =  y  |  I2  =  i)  is  same  except  using  different  0s. 

The  probability  P(IX  =  iu  I2  =  i2  |  both  affected)  is  equal  to, 


P(h  =  ii,  I2  =  ij,  both  affected) 
P(both  affected) 

EE  E   E  P{h  =  H,  h  =  »'a,  Eu  E2,  PGU  PG2,  both  affected) 

_       Ex   E2  PGj  PG2 

P(both  affected) 
=    £E  S  S  P(both  affected|/!  =  i,  I2  =  i2  Eu  E2,  PGU  PG2) 

£,     E2   PGi  PG2 
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P(/,  =  iu  I2  =  %2  ft,  ft,  PGi,  PG*2)}/P(both  affected) 
EEEE  ^(both  affected|ft,  E2)P(Il  =  i„  h  =  *2  ft,  ft,  PGi,  PG2) 

£»i   £2  PG\  PG2 

P(both  affected) 
{££  £    £  ^(both  affected|ft,  ft)P(/!  =  in  ft,  PG1)P(/2  =  ij,  ft,  PG2)} 

Ei   £/2  PG\  PG2 

P(both  affected) 
{£53££  ^(both  affected|ft,  ft)P(/i  =  ii  |  ft,  PGi) 


P(both  affected)     %    ft  ^  ^ 

P(ft  I  PG1)P(PG1)P(I2  =  i2  I  ft,  PG2)P(E2  I  PG2)P(PG2)} 

Because  both  sibs  are  affected,  hence  when  &o  =  0  the  following  restriction  on  sum- 
ming over  Ei  and  E2  are  applied;  if  ft  7^  (1,1)  and  ft  ^  (1,1)  then  (ft,  ft) 
must  be  ((1,0),  (0,1))  or  ((0,1), (1,0)).  P(both  affected|ft,  ft)  is  given  in  Table 
4.6.  The  probabilities  P(both  affected)  =  £i=0  £,=0  P{h  —  *>  ^2  —  J,  both  affected). 
P(/i  =  i,  I2  =  j,  both  affected)  are  given  as  follows.  The  rest  probabilities  were  given 
in  Table  4.1.  Therefore,  P(h  =  ij,  h  =  H  I  both  affected)  can  be  found. 

Let  Pl  =  P(PG  =  1),  p2  =  P{PG  =  2),  p3  =  P{PG  m  3),  and  p456  =  P{PG  = 
4,  5,  or  6),  Then, 

P(h  =  0,I2  =  0,BA) 

.,   .2/l     1       3      1\  /  9      2      1 

rtx2  (\     1       3     1\  A     1       3      1 
+2A    UP22  +  ^33JUP22  +  T^P33 

„  x2 /1    1     3    r2 

-HfcA    ^  +  ^33 

^2/9         2         1  ^ 

+k°X    (,16P39  +  4P45' 
=    ^(2P2+,3)2 
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Table  4.6.  P(both  affected|£q,  E2)  and  possible  trait  IBD  given  E\  and  E2. 


Ex 

E2 

P(BA\EU  E2) 

possible  trait  IBD  score,  (h,I2),  given  E\  and  E2 

(1,1) 

(0,0) 

A2 

(2,0),  (2,1),  (2,2) 

(1,1) 

(0,1) 

2A2 

(2,0),  (2,1) 

(1,1) 

(1,0) 

2A2 

(2,0),  (2,1) 

(1,1) 

(1,1) 

4A2 

(2,2) 

(1,0) 

(0,0) 

KoA 

(1,0),(1,1),(1,2),(0,0),(0,1),(0,2) 

(1,0) 

(0,1) 

A2 

(1,0),(1,1),(0,0),(0,1) 

(1,0) 

(1,0) 

ZkqA 

(1,0),(1,1),(0,0),(0,1) 

(1,0) 

(1,1) 

2A2 

(1,2),(0,2) 

(0,1) 

(0,0) 

k0X2 

(1,0),(1,1),(1,2),(0,0),(0,1),(0,2) 

(0,1) 

(0,1) 

ZKqA 

(1,0),(1,1),(0,0),(0,1) 

(0,1) 

(1,0) 

A2 

(1,0),(1,1),(0,0),(0,1) 

(0,1) 

(1,1) 

2A2 

(1,2),(0,2) 

(0,0) 

(0,0) 

k0X 

(2,0),(2,1),(2,2),(1,0),(1,1),(1,2),(0,0),(0,1),(0,2) 

(0,0) 

(0,1) 

koX 

(2,0),(2,1),(1,0),(1,1),(0,0),(0,1) 

(0,0) 

(1,0) 

KoA 

(2,0),(2,1),(1,0),(1,1),(0,0),(0,1) 

(0,0) 

(1,1) 

A2 

(2,2),(1,2),(0,2) 
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+£j;M2(2p2  +  p3)(2p2  +  3p3  +  4p456) 


+  ^o2A2(P3  +  2j0456)2 


P(h  =  1,  l2  =  0,  BA)  =  P(h  =  2,  I2  =  1,  flA) 

ot_  .j/1     1       3      2\  /  9      2      1       \ 
2koX    UP22  +  T6P33 )  UP39  +  iP45V 

^Q.2/l     1       3      2\  /l     1       3      1\ 
+2A    U*2  + 16*3^2  + 16*3  J 

^  tff1      1        3      2\  /l      1        3      1\ 

+4M    UP22  +  ^33jUP22  +  ^P33j 

+2M    UP22  +  I^39  +  2P™P™)  UP22  +  l6P33 J 

,.3v2/l     1       9      4      1  \  /9      2      1       \2 

+k°X    UP22  +  T6P39  +  2P456P456J  UP39  +  iP45V 

^A2(p2+p3)(2p2+P3) 

+Y^kQx2(p2  +  p3)(p2  +  p3  +  p456) 

+  ^oA2(p2  +  2p3  +  4p456)(2p2  +  p3) 
+-^kox2(P2  +  2p3  +  4p456)(p3  +  2p456) 

P(h  =  2J2  =  0,  BA)  =  P(h  =  0J2  =  2,  BA) 
A2  (Pl  +  \P2  +  Te*)  (l6P4  +  \P™) 
+4X2(Pl  +  lP>  +  hP3)(W2+hP3l) 
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119      2      1 


9      2      1 


(  ^  +  16P39  +  4P456  >  I  ^P3«  H'  4P456 


119      3      1 


+2fc0A         7P2  7T  +  TTP3  77  +  TP456  TP2^  +  TTP3X 


V   2      16"  9      4 


16r  9      4' 


113      1 


4r  2      16r  3 


128 


A2(16pi  +  4p2  +  Pa)(4p2  +  3p3  +  2p45e) 


128 


kQ\2(4p2  +  3p3  +  4p456)(2p2  +  Pa) 


1 


+—k20\2{2p2  +  3p3  +  4p456)(P3  +  2P456) 


P{h  =  l,I2  =  l,BA) 


11       3     2\/l     1       9      4      1 


l4P22  +  l6P33JUP22  +  16P39  +  2P456 


.11        3      2 
UP22  +  16P33 


,11        3      2 
4M    [-p,i  +  ^3 


.11       9      4      1 
"*°A    UP22  +  T6P39  +  2P456 


^A2(p2+p3)2 

+— fcoA2(p2  +  Pa)(2p2  +  3p3  +  4p45e) 
16 

+  —/c2A2(p2  +  2p3+4p456)2 

o4 


P(/,  =  2,  /2  =  1,  BA)  =  P{h  =  l,/i  =  2,  SA) 

,./         1  1      \  /l     1       9      4      1        \ 

A    \Pl  +  AP2  +  16P3J  UP22  +  l6P39  +  2P456J 
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,  ,2x2  /I     1       9      3      1        \  /l     1       9      4      1 

+M    UP22  +  l6P39  +  4P456J  UP22  +  T^P39  +  2P456 

^ou  \*(l     l       9      3      1        \  A     1       3      2\ 
+2M    [-aP2~  +  -pa-  +  ^456J  (jft-  +  -Pag J 

— A2(16p!  +  4p2  +  p3)(5p2  +  7p3  +  4p456) 

+  64^°A^2p2  +  3p3  +  4P456)(^2  +  Pa) 

+  128  k2°X^2p2  +  3P3  +  4^56)(P2  +  2p3  +  4p456) 

P(I1  =  2tI2  =  2,BA) 

ox2/^       ,   1  1      \  /l     1        9      4       1        \ 

=    2A    ^  +  ^  +  -P3 j  ^P2-  +  _,3_  +  _p456j 

/  1  1        \  2 

+4A2  U  +  -pa  +  —  P3 


,  ;2,2  /l      1        9      3       1        \ 
+k°X    {4P22  +  T6P39  +  4P456) 


1 
128 

1 


256 


A2  (16p!  +  4p2  +  p3)  (32p!  +  10p2  +  5p3  +  4jo456) 
&02A2  (2/>2  +  3p3  +  4p456)2 


Therefore,  the  joint  distribution  of  Xu  and  X2j  and  X3j  and  X^-  conditional  on 
both  sibs  are  affected,  pxyuv,  can  be  found.  Even  though  pxyuv  is  known,  it  takes  too 
much  time  to  compute  the  exact  joint  distribution  of  ,$i,  ,S2,  iS3  and  jS4  conditional 
on  both  sibs  are  affected  using  the  same  method  we  used  in  the  chapter  3  for  joint 
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distribution  of  ,50  and  jSi-  Therefore,  Monte  Carlo  integration  were  used  to  find  the 
conditional  joint  distribution  of  iS\,  jSa,  1S3  and  1S4. 

4.3.3     Simulation  under  More  Realistic  Assumptions 

For  the  same  reason  as  in  the  simple  disease  case,  we  also  performed  simulations 
for  an  independent  model  and  a  Markov  chain  model  without  combining  Stage  I  and 
Stage  II  data.  The  total  length  of  chromosomes  is  3300  cM  and  divided  into  22 
chromosomes.  The  number  of  genes  is  two.  The  combinations  of  parameters  are: 
resource  N  =  2000,  5000,  7500,  10,000,  and  20,000;  number  of  ASP  n  =  50,  100,  and 
150;  population  frequency  (P.F.)  of  disease  gene  =  0.01,  0.2,  and  0.5;  relative  risk, 
ko,  of  being  affected  of  carrying  no  gene  vs.  carrying  two  recessive  genes  at  one  locus 
is  equal  to  0,  0.05,  0.1  and  1.0.  Map  densities  used  are  5  cM,  10  cM,  and  20  cM. 
Threshold  k  from  1  to  n1?  and  nx  from  5  to  n  -  5  with  increment  5.  The  simulations 
were  conducted  similar  to  the  simulation  in  chapter  3.  We  simulated  22  pairs  of 
chromosomes  instead  of  one;  two  genes  instead  of  one.  The  details  of  simulation  are 
as  follows. 

1.  Reading  in  parameters,  resource  N,  P.F.,  A;0,  number  of  ASP  n,  and  map,  nx, 
and  k  of  the  design. 

2.  Generate  gene  location  from  uniform  distribution(0, 3300/22)  twice,  first  one  for 
the  gene  on  chromosome  1,  and  second  one  for  the  gene  on  chromosome  2.  Then 
follow  the  rest  of  step  2  on  page  61.  For  those  chromosomes  were  not  chosen, 
set  /=0.  Generate  joint  trait  IBD  for  two  loci. 

3.  Follow  step  3  and  4  on  page  61  for  every  chromosome. 

4.  Repeat  step  2  and  3  ni  times  and  then  calculated  iS{. 

5.  Check  if  any  of  i5i,  iS2,  1S3,  and  154  >  k.  If  yes,  then  calculate  N2  according 
to  the  resource  constraint,  choose  markers  whose  2Si  >  k  for  Stage  II  study, 
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and  go  to  the  next  step.  If  not,  record  the  detection  as  a  failure  and  go  back  to 
step  2. 

6.  The  XijB  in  Stage  II  are  generated  the  same  way  as  those  in  Stage  I,  except 
only  the  chosen  markers  are  used  and  repeat  A^  times  instead  of  n^ 

7.  Check  whether  251  or  25a,  or  2S3  or  2^4,  according  to  which  one  passed  Stage 
I,  are  the  largest  among  all  2S';s  in  Stage  II.  If  any  one  of  them  is  the  largest,  it 
is  a  success;  otherwise  it  is  a  failure.  Go  back  to  step  2  until  enough  simulation 
has  been  done. 

The  programs  were  compiled  and  ran  in  the  same  computing  environment  men- 
tioned in  chapter  3. 

4.3.4     Results 

Based  on  analytic  computation  and  simulation,  the  design  with  the  highest  prob- 
ability of  finding  the  correct  marker(s)  was  identified  for  a  disease  with  two  unlinked 
recessive  genes  with  additive  penetrance.  These  optimal  designs  are  given  in  Ta- 
ble 4.7. 

In  the  tables,  the  P.F.  column  shows  the  population  frequency  of  the  disease 
gene;  the  k0  columns,  the  relative  risk;  the  map  columns,  the  map  density;  the  k 
column,  the  threshold  for  passing  Stage  I  for  Stage  II  study;  the  m  columns,  the 
number  of  ASP  used  in  Stage  I;  the  F2  column,  the  probability  of  locating  the  right 
marker  by  the  best  two-stage  design  obtained  by  analytic  formula;  the  1.1,  1.2,  and  1.3 
columns  show  the  probabilities,  obtained  by  simulation  with  independent  assumption, 
of  finding  the  markers  next  to  the  genes,  within  two  intervals,  and  within  three 
intervals,  respectively,  by  the  two-stage  design.  M.l,  M.2,  and  M.3  columns  show  the 
probabilities  corresponding  to  1.1,  1.2,  and  1.3  obtained  by  simulation  with  Markov 
chain  assumption  without  combining  first-  and  second-stage  data;  The  Fl  column 
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gives  the  that  probability  of  the  best  one-stage  design,  i.e.,  with  optimal  m  and  n 
subject  to  mn  <  N  obtained  by  analytic  formula.  The  F2— Fl  column  shows  the 
increase  in  probability  of  two-stage  design  over  one-stage. 

4.4     Discussion 

The  Monte  Carlo  results  indicated  that  the  relative  errors  between  the  probabil- 
ities calculated  from  formulas  and  Markov  chain  simulations  were  almost  negligible 
when  using  a  20  cM  map;  little  discrepancy  for  10  cM;  a  greater  difference  for  5  cM 
maps.  This  is  most  likely  due  to  the  fact  that  under  a  Markov  chain  model,  when 
using  a  denser  map,  we  need  more  samples  to  separate  out  the  markers  next  to  the 
markers  adjacent  to  the  genes.  If  we  examine  1.2,  1.3,  M.2,  and  M.3  columns.  We  will 
find  that  for  a  denser  map  it  can  still  narrow  the  gene  location  to  an  area  wider  than 
the  interval  between  two  adjacent  markers.  For  example,  for  N=10,000,  ASP=50, 
P.F.=0.01,  and  Jfc0  =  0,  M.l  is  only  0.88  but  M.2  is  0.97.  This  shows  that  if  a  denser 
map  is  used,  we  might  need  more  ASPs  to  pinpoint  the  gene  location. 

The  population  frequency  of  the  disease  genes  and  relative  risk  fco  of  carrying  no 
gene  verses  carrying  two  recessive  genes  at  one  locus  have  interaction  on  the  proba- 
bility of  finding  correct  markers.  For  given  resource,  ASP,  and  population  frequency, 
the  probability  of  finding  correct  markers  is  a  decreasing  function  of  relative  risk  k0. 
If  population  frequency  is  increased,  then  the  speed  of  the  decrease  in  probability  in 
relative  risk  will  be  slower.  This  is  a  very  intuitive  result;  when  population  frequency 
of  disease  is  low  and  relative  risk  of  to  be  affected  for  carrying  no  gene  vs.  carrying 
two  recessive  gene  at  one  locus  is  high,  there  will  be  a  lots  of  phenocopy  cases  in 
the  sample.  Hence,  the  probability  to  find  correct  markers  will  be  small.  On  the 
other  hand,  when  population  frequency  of  disease  is  high,  and  the  relative  risk  is  low, 
there  will  only  be  a  small  number  of  phenocopy  cases  in  the  sample.  Therefore,  the 
probability  to  find  correct  markers  will  be  large. 
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Two-stage  approach  once  again  provides  big  improvement  in  many  cases,  for  ex- 
ample, when  N=2,000,  ASP=100,  population  frequency  of  disease  gene=0.5,  and  the 
relative  risk  =  0.1,  the  probability  of  finding  at  least  one  of  two  disease  gene  loci  is 
boosted  from  30%  to  80%  by  a  two-stage  design.  The  improvement  over  40%  were 
marked  with  *  in  Table  4.7.  Also  once  again,  two-stage  approach  performs  no  better 
or  worse  than  one-stage  approach  is  observed,  all  the  worse  cases  happened  when  all 
the  ASP  can  be  typed  in  one-stage  designs. 


CHAPTER  5 
CONCLUDING  REMARKS 


Gene  hunting  is  a  difficult  task,  it  usually  requires  a  vast  amount  of  resources. 
Two-stage  approach  is  one  way  to  reduce  cost.  This  dissertation  gives  the  optimal 
designs  that  can  utilize  resources  very  efficiently. 

When  a  researcher  wants  to  use  the  results  in  this  dissertation,  with  information 
from  other  researchs  or  other  consideration,  he/she  must  postulate  if  the  disease  is 
caused  by  a  single  gene  or  multiple  genes,  the  percentage  of  phenocopy,  population 
frequency  of  disease  gene,  and  relative  risk.  If  funding  and  sample  (ASPs)  were 
already  obtained,  researcher  should  check  the  tables  to  find  the  combination  of,  the 
genetic  model  (dominant  or  recessive),  funding  condition  (N),  size  of  the  sample  (n), 
and  heterogeneity  (or  population  frequency  and  relative  risk)  that  match  his/her 
situation,  and  then  use  the  design  in  the  tables.  If  the  funding  and  sample  are  not 
yet  obtained,  the  researcher  should  consult  with  the  tables  and  find  the  design  that 
would  fit  his/her  possible  funding  and  sample  conditions.  For  example,  if  a  researcher 
believes  that  the  disease  is  caused  by  a  dominant  gene,  and  think  that  100  ASP  will 
be  available  for  typing  and  can  only  inquire  funding  enough  to  type  5,000  makers 
loci  for  ASPs,  then  using  125  markers  and  20  ASPs  in  Stage  I,  choose  17  markers 
for  Stage  II  study  would  be  a  design  that  provides  a  approximate  75%  probability  of 
finding  the  right  location.  On  the  other  hand,  if  he/she  can  not  collect  more  than  75 
ASPs  for  the  study,  and  funding  for  typing  10,000  markers  is  available,  then  use  275 
markers  and  15  ASPs  in  Stage  I  and  choose  59  markers  for  Stage  II  would  also  provide 
a  approximate  75%  probability  of  finding  the  right  location.  If  a  75%  probability  is 
not  good  enough,  then  the  researcher  has  to  find  extra  funding  or  more  ASP  for  the 
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study,  there  is  no  other  two-stage  design  can  do  better  than  the  design  obtained  from 
the  tables.  If  the  combination  of  the  parameters  is  out  of  the  range  of  the  tables  given 
in  this  dissertation,  then  researcher  should  obtain  the  programs  from  this  author  to 
calculate  the  best  design. 

In  the  future,  extending  the  method  presented  in  this  dissertation  to  searching 
multiple  complex  disease  genes  (more  than  two),  a  simple  quantitative  trait  loci 
(QTL)  and  complex  QTL's,  and  to  incorporate  Risch's  risk  ratio  method  are  planned. 
Extension  to  incorporate  different  genetic  model  should  also  be  included. 
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